CN111753083A - Complaint report text classification method based on SVM parameter optimization - Google Patents

Complaint report text classification method based on SVM parameter optimization Download PDF

Info

Publication number
CN111753083A
CN111753083A CN202010389257.6A CN202010389257A CN111753083A CN 111753083 A CN111753083 A CN 111753083A CN 202010389257 A CN202010389257 A CN 202010389257A CN 111753083 A CN111753083 A CN 111753083A
Authority
CN
China
Prior art keywords
individual
individuals
text
svm
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010389257.6A
Other languages
Chinese (zh)
Inventor
范青武
陈�光
杨凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010389257.6A priority Critical patent/CN111753083A/en
Publication of CN111753083A publication Critical patent/CN111753083A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for automatically classifying complaint report texts, which aims to improve the classification precision and the working efficiency of personnel. The implementation of the invention comprises the following steps: obtaining a certain number of complaint report texts with category labels, and dividing the complaint report texts into a training set text and a test set text; performing word segmentation on the text and removing stop words; constructing a text model, and performing feature extraction and dimension reduction on the text model; training a Support Vector Machine (SVM) by using a training set text model; dynamically optimizing the parameters of the test set text according to the classification accuracy of the SVM to obtain the optimal parameter value of the SVM by adopting an improved drosophila optimization algorithm (IFOA); and preprocessing the complaint report texts to be classified, and inputting the complaint report texts into the SVM subjected to parameter optimization, so that automatic classification can be realized. The method is suitable for the automatic classification of various complaint report texts, has higher classification precision, and solves the problems of low manual classification precision and low efficiency.

Description

Complaint report text classification method based on SVM parameter optimization
Technical Field
The invention relates to the technical field of natural language processing, in particular to a complaint report text classification method based on SVM parameter optimization.
Background
The complaint report is one of the best ways to realize democratic management, public participation and public supervision. For government departments, the system can fully exert the strength of people and improve the working efficiency; for the enterprise unit, the opinion and attitude of the served group can be truly and objectively reflected.
At present, most complaint reporting platforms are designed based on the internet, and with the rapid development of network technology, online work is convenient and rapid, so that the number of reports is greatly increased. In order to process the massive information more efficiently, the staff can classify the massive information according to a certain rule and send the massive information to corresponding departments according to the classification. However, manually classifying the complaint report texts consumes a lot of time and cost, and due to subjective differences of workers, service levels and other reasons, understanding of one thing is deviated, so that classification errors are caused, and subsequent work is directly affected.
The complaint report text has the characteristics of various categories, irregular expression, unobvious characteristics and the like, the traditional text classification method, such as Text Clustering (TC), topic model (LDA) and the like, cannot realize accurate classification, and the artificial neural network has the characteristics of overfitting, poor popularization capability and the like. Therefore, an algorithm with strong classification capability and generalization capability is required. A Support Vector Machine (SVM) is a classification algorithm based on statistical learning, is suitable for the classification problems of small samples, high dimensionality, nonlinearity and the like, can map a low-dimensional space to a high-dimensional space after a kernel function is introduced to realize accurate classification, has strong adaptability, and is suitable for the practical problems related to the invention. However, the accuracy of the SVM classification is closely related to the size of the parameter, so that the selection of a proper parameter is also important.
For the parameter selection of the SVM, manual debugging is time-consuming and labor-consuming and has low precision, so that the optimization of the parameters by applying an optimization algorithm is an optimal method. The drosophila optimization algorithm (FOA) is proposed by pan super-imitating foraging behavior of drosophila, and has stronger optimization capability compared with Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). However, FOA also has disadvantages, such as fixed search range, low population diversity, etc., which make it still converge to local optimum when optimizing complex problems. Thus, there is a need for improvements to FOAs to improve their optimizing capabilities.
Disclosure of Invention
The invention provides a text classification method for dynamically optimizing parameters of an SVM (support vector machine) by using an improved drosophila optimization algorithm (IFOA) aiming at the problem that the text classification of complaints and reports is difficult. The SVM has good classification capability and generalization capability, the optimizing capability of the IFOA is stronger than that of the FOA, and the optimal SVM parameter can be found more accurately, so that the classification accuracy is improved.
The technical scheme of the invention comprises the following steps:
step 1. text acquisition and preprocessing
Step 1.1, a certain number of complaint report texts and the categories thereof are obtained, the category labels are ensured to be accurate, and then the texts are divided into a training set text and a test set text according to a certain proportion.
Step 1.2 word segmentation and stop word removal are performed on all texts using the jieba toolkit of python.
Step 1.3, modeling the text subjected to word segmentation and stop word removal by adopting a space vector model (VSM), and expressing the text as a space vector, wherein the expression of the space vector is as follows:
Di=D(t1,w1;t2,w2;…;tn,wn),i∈N+,n∈N+(1)
in the above formula, D represents a space vector of a certain text, i represents a number of the certain text, t represents a sub-vector corresponding to a certain word in the certain text, w represents a weight of the sub-vector, and n is a label of the sub-vector.
1.4, using a word frequency-inverse document frequency (TF-IDF) algorithm to perform feature extraction and dimension reduction on a text space vector, wherein the method comprises the following steps:
the TF-IDF comprehensively considers the frequency of a single word appearing in a single document and the total frequency of the word appearing in a text set, has higher statistical precision, and has the following calculation formula:
Pi=tfij×idfi(2)
in the above formula, PiRepresenting the comprehensive frequency of a certain word, tf representing the frequency of the word appearing in a document, idf representing the proportion of the document containing the word in the whole text set, i representing the label of the word, and j representing the label of the document. If P of the word is less than a certain value, the comprehensive frequency is considered to be low, and the word is removed.
Because the text is converted into a space vector form, the comprehensive frequency of a certain word, namely the comprehensive frequency of the sub-vector corresponding to the word in the whole space vector, achieves the effect of reducing the dimension by eliminating the sub-vector with low frequency. In step 1.3, DiAfter dimension reduction, the method can be expressed as follows:
D′i=D(t1,w1;t2,w2;…;tk,wk),i∈N+,k∈N+and k < n (3)
In the above formula, D'iIs less than Di
Step 2.SVM training
The general form of a linear discriminant function is:
g(x)=wTx+b=0 (4)
the points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (5)
in the above equation, w is a normal vector of the classification plane, b is an offset, and x is (x, y).
The points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (6)
the optimal classification surface can also be expressed as the minimum of the following function under the above constraint:
Figure BDA0002485155840000041
therefore, introduce the Lagrange function:
Figure BDA0002485155840000042
the problem is converted into a dual problem of solving the minimum value of the Lagrange function for w and b, namely convex quadratic programming. By solving the dual problem of the convex quadratic programming, a classification surface function can be finally obtained:
Figure BDA0002485155840000043
as the Gaussian Radial Basis Function (RBF) has stronger mapping capability, the invention selects the RBF as the kernel function of the SVM:
Figure BDA0002485155840000044
in the above equation, σ is a width parameter of RBF, and represents control of the radial range. If the sigma is smaller than the minimum distance between all training samples, all samples are support vectors and can be classified correctly; otherwise, all samples would be classified as a class, rendering them incapable of learning.
After the kernel function is introduced, the problem of nonlinear classification is solved, but some linear inseparable problems still exist in the transformed sample space, and the linear discriminant function is difficult to satisfy. Therefore, in order to make the classifier have a certain fault tolerance, a relaxation variable is introduced, so that the conditions are met as follows:
yi[w·xi+b]-1+≥0,i=1,2,…,n (11)
in order to control the integral error number and ensure the classification precision, a penalty factor C is introduced, so that the constraint condition is increased:
Figure BDA0002485155840000045
in the above formula, the larger C is, the greater the punishment degree of the misclassification is, but the generalization ability of the classification is reduced; otherwise, the classification accuracy is reduced.
Therefore, the SVM training steps are as follows:
and 2.1, determining the values of C and sigma.
And 2.2, inputting the preprocessed training set text into the SVM for training, and substituting C and sigma.
Step 3, SVM parameter optimization
FOA is an optimization algorithm based on the Drosophila foraging principle. The invention provides an IFOA optimization algorithm aiming at the defects of FOA, which comprises the following calculation steps:
(1) parameter of the initialization algorithm, i.e. maximum number of iterations gmaxPopulation size p, initial search radius R, and initial location coordinates X of individual drosophila.
X=(Random-0.5)·π (13)
In the above formula, Random is a Random number between (0,1), and X is a position coordinate value of an individual.
(2) Calculating the taste concentration judgment value of all drosophila individuals:
S=tan(X) (14)
in the above formula, S is a taste concentration judgment value of an individual.
(3) Sequentially bringing the taste concentration judgment values of all fruit fly individuals into an objective function (to-be-optimized problem), obtaining the fitness values of the individuals, selecting the individuals with the maximum and minimum fitness values, namely the optimal individual and the worst individual, and recording the positions and the fitness values of the individuals:
fitness=f(Sn) n=(1,2,…,p) (15)
[bestfitness,bestlocation]=max(fitness) (16)
[worstfitness,worstlocation]=min(fitness) (17)
in the above formula, n is the individual label, the fitness value set of all the individuals is the fixness, f (x) is the objective function, bestfittess is the maximum fitness value, bestfitterion is the position of the optimal individual, Worstfixss is the minimum fitness value, and Worstresidence is the position of the worst individual.
(4) And calculating the distances between all drosophila individuals and the optimal individual and the worst individual, if the distances between the drosophila individuals and the optimal individual are shorter than the distances between the drosophila individuals and the worst individual, dividing the drosophila individuals into subgroups with stronger searching capability, and otherwise, dividing the drosophila individuals into subgroups with poorer searching capability.
Figure BDA0002485155840000061
Figure BDA0002485155840000062
In the above formula, distancebestIs the distance, X, between an individual and an optimal individualbestlocationLocation, distance, for optimal individualsworstIs the distance, X, between an individual and the worst individualworstlocationThe location of the worst individual.
(5) And the fruit fly subgroups with stronger searching capability and poorer searching capability are searched under the guidance of the optimal individual according to different radiuses respectively, and the positions are updated.
Xbest=Xbestlocation+Rbest(Random-0.5)·π (20)
Xworst=Xbestlocation+Rworst(Random-0.5)·π (21)
Wherein:
Figure BDA0002485155840000063
Figure BDA0002485155840000064
in the above formula, XbestTo search for the position coordinates of an individual among the subgroups with stronger power,Rbestsearch radius for individuals belonging to a subgroup with stronger search power, XworstFor searching for the position coordinate, R, of an individual among subgroups of poorer powerworstRepresenting the search radius, g, of individuals belonging to subgroups with poor search abilityiIndicates the current number of iterations, fitnessiIndicating the fitness value, fitness, of the current individuali+1Representing the fitness value of the previous generation of individuals, m and n being constants.
(6) Calculating the taste concentration judgment values and fitness values of all drosophila individuals after the positions are updated, recording the positions and fitness values of new optimal and worst individuals, and if the fitness value of the optimal individual is smaller than the value of the previous generation, the position of the optimal individual is still extended to the position of the previous generation.
(7) And (5) entering an iterative process of the algorithm, repeating the steps (4) to (6), if the maximum iteration times is reached, finishing the algorithm, and outputting the taste concentration judgment value of the last generation of the optimal individual, namely the optimal solution of the objective function.
Thus, the steps of optimizing parameters of the SVM by IFOA are as follows:
step 3.1 initialize IFOA parameters including population size, maximum iteration number, initial search radius, initial position coordinates of fruit fly and values of m and n.
Step 3.2 calculate the taste concentration decision value for all individual drosophila species, which is also a two-dimensional array.
And 3.3, taking two elements in the taste concentration judgment value of the drosophila individual as C and sigma respectively, sequentially inputting the C and sigma into the SVM, training by using the preprocessed training set text, and then carrying out classification test on the SVM by using the preprocessed test set text. At this time, the target function of IFOA will be replaced with a classification accuracy function with C and sigma as arguments, i.e., Pprecision(C, σ). Using a function PprecisionAnd calculating the classification accuracy of the SVM to be used as the fitness value of the individual, simultaneously selecting the individual with the maximum and minimum fitness values, namely the optimal individual and the worst individual, and recording the position and the fitness value of the individual.
And 3.4, calculating the distances between all drosophila individuals and the optimal individual and the worst individual, if the distances between the drosophila individuals and the optimal individual are shorter than the distances between the drosophila individuals and the worst individual, dividing the drosophila individuals into subgroups with stronger searching capability, and otherwise, dividing the drosophila subgroups into subgroups with poorer searching capability.
And 3.5, searching the drosophila subgroups with stronger searching capability and poorer searching capability under the guidance of the optimal individual according to respective radiuses, and updating the positions.
And 3.6, calculating the taste concentration judgment values of all the drosophila individuals after the positions are updated, sequentially inputting the taste concentration judgment values into the SVM as C and sigma, training and testing, and calculating the classification accuracy as a new individual fitness value. And then, recording the positions and the fitness values of the new optimal and worst individuals, and if the fitness value of the optimal individual is smaller than the value of the previous generation, the position of the optimal individual is still extended to the position of the previous generation.
And 3.7, entering an iterative process of the algorithm, repeating the steps 3.4 to 3.6, finishing the algorithm if the maximum iterative times is reached, and outputting the taste concentration judgment value of the last generation of the optimal individual as the optimal parameter of the SVM.
Step 4. model usage
And 4.1, preprocessing the complaint report text to be classified, wherein the preprocessing steps are the same as 1.2-1.4.
And 4.2, inputting the preprocessed complaint report text into the SVM which is optimized by the parameters.
And 4.3, acquiring the output of the SVM, namely the category to which the complaint report text belongs.
Advantageous effects
The invention combines IFOA with SVM, dynamically optimizes SVM parameter by using strong optimizing ability of IFOA, further enhances classification precision and adaptability, and is more suitable for the complicated classification problem of complaint report text classification.
Drawings
FIG. 1 is a diagram of the location of a text space vector in coordinates.
Fig. 2 is an optimal hyperplane for an SVM.
FIG. 3 is an input space mapping for an SVM.
FIG. 4 is a schematic representation of foraging principle of a Drosophila population.
FIG. 5 is a function PprecisionThe method of (3).
Fig. 6 is an embodiment of the present invention.
Detailed Description
The invention is further described below in connection with fig. 6. The examples are given to illustrate the invention and not to limit its scope of use, and any modifications within the scope of the claims will fall within the scope of the invention.
In the embodiment, an environmental pollution complaint report text is taken as a research object, a certain amount of effective data is obtained from a report platform, and the method is applied to the following steps:
step 1. text acquisition and preprocessing
Step 1.1, a certain number of complaint report texts and the categories thereof are obtained, the category labels are ensured to be accurate, and then the texts are divided into a training set text and a test set text according to a certain proportion.
Step 1.2 word segmentation and stop word removal are performed on all texts using the jieba toolkit of python.
Step 1.3, modeling the text subjected to word segmentation and stop word removal by adopting a space vector model (VSM), and expressing the text as a space vector, wherein the expression of the space vector is as follows:
Di=D(t1,w1;t2,w2;…;tn,wn),i∈N+,n∈N+(1)
in the above formula, D represents a space vector of a certain text, i represents a number of the certain text, t represents a sub-vector corresponding to a certain word in the certain text, w represents a weight of the sub-vector, and n is a label of the sub-vector.
If will (t)1,t2,…,tn) Viewed as an n-dimensional coordinate system, weight (w)1,w2,…,wn) The position of the space vector of the text in the coordinate system is shown in fig. 1 when viewed as corresponding coordinates.
1.4, using a word frequency-inverse document frequency (TF-IDF) algorithm to perform feature extraction and dimension reduction on a text space vector, wherein the method comprises the following steps:
the TF-IDF comprehensively considers the frequency of a single word appearing in a single document and the total frequency of the word appearing in a text set, has higher statistical precision, and has the following calculation formula:
Pi=tfij×idfi(2)
in the above formula, PiRepresenting the comprehensive frequency of a certain word, tf representing the frequency of the word appearing in a document, idf representing the proportion of the document containing the word in the whole text set, i representing the label of the word, and j representing the label of the document. If P of the word is less than a certain value, the comprehensive frequency is considered to be low, and the word is removed.
Because the text is converted into a space vector form, the comprehensive frequency of a certain word, namely the comprehensive frequency of the sub-vector corresponding to the word in the whole space vector, achieves the effect of reducing the dimension by eliminating the sub-vector with low frequency. In step 1.3, DiAfter dimension reduction, the method can be expressed as follows:
D′i=D(t1,w1;t2,w2;…;tk,wk),i∈N+,k∈N+and k < n (3)
In the above formula, D'iIs less than Di
Step 2.SVM training
The SVM is a machine learning classification algorithm based on an optimal hyperplane theory, wherein the optimal hyperplane is shown in FIG. 2.
The general form of a linear discriminant function is:
g(x)=wTx+b=0 (4)
the points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (5)
in the above equation, w is a normal vector of the classification plane, b is an offset, and x is (x, y).
The points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (6)
the optimal classification surface can also be expressed as the minimum of the following function under the above constraint:
Figure BDA0002485155840000101
therefore, introduce the Lagrange function:
Figure BDA0002485155840000102
the problem is converted into a dual problem of solving the minimum value of the Lagrange function for w and b, namely convex quadratic programming. By solving the dual problem of the convex quadratic programming, a classification surface function can be finally obtained:
Figure BDA0002485155840000103
at this point, a kernel function may be introduced to map the input space into a high-dimensional Hilbert space, as shown in fig. 3.
As the Gaussian Radial Basis Function (RBF) has stronger mapping capability, the invention selects the RBF as the kernel function of the SVM:
Figure BDA0002485155840000111
in the above equation, σ is a width parameter of RBF, and represents control of the radial range. If the sigma is smaller than the minimum distance between all training samples, all samples are support vectors and can be classified correctly; otherwise, all samples would be classified as a class, rendering them incapable of learning.
After the kernel function is introduced, the problem of nonlinear classification is solved, but some linear inseparable problems still exist in the transformed sample space, and the linear discriminant function is difficult to satisfy. Therefore, in order to make the classifier have a certain fault tolerance, a relaxation variable is introduced, so that the conditions are met as follows:
yi[w·xi+b]-1+≥0,i=1,2,…,n (11)
in order to control the integral error number and ensure the classification precision, a penalty factor C is introduced, so that the constraint condition is increased:
Figure BDA0002485155840000112
in the above formula, the larger C is, the greater the punishment degree of the misclassification is, but the generalization ability of the classification is reduced; otherwise, the classification accuracy is reduced.
Therefore, the SVM training steps are as follows:
and 2.1, determining the values of C and sigma.
And 2.2, inputting the preprocessed training set text into the SVM for training, and substituting C and sigma.
2.1, determining the values of C and sigma.
2.2 inputting the preprocessed training set text into the SVM for training, and substituting C and sigma.
Step 3, SVM parameter optimization
The FOA is an optimization algorithm based on the foraging principle of drosophila, as shown in fig. 4. The invention provides an IFOA optimization algorithm aiming at the defects of FOA, which comprises the following calculation steps:
step 3.1 initialize IFOA parameters including population size p set to 20, maximum number of iterations gmaxSetting the initial search radius R to be 2, setting the initial phase angle coordinate X of the fruit fly to be a two-dimensional array, wherein the range of the array elements is [ -pi/4, pi/4]The value of m is 16, and the value of n is 32.
Figure BDA0002485155840000121
In the above formula, Random is a Random number between (0,1), and X is a position coordinate array of an individual.
Step 3.2 calculate the taste concentration decision value for all individual drosophila species, which is also a two-dimensional array.
Figure BDA0002485155840000122
In the above formula, S is a taste concentration determination array for an individual.
And 3.3, taking two elements in the taste concentration judgment value of the drosophila individual as C and sigma respectively, sequentially inputting the C and sigma into the SVM, training by using the preprocessed training set text, and then carrying out classification test on the SVM by using the preprocessed test set text. At this time, the target function of IFOA will be replaced with a classification accuracy function with C and sigma as arguments, i.e., Pprecision(C, σ), a specific calculation method of this function is shown in fig. 5. Using a function PprecisionAnd calculating the classification accuracy of the SVM to be used as the fitness value of the individual, simultaneously selecting the individual with the maximum and minimum fitness values, namely the optimal individual and the worst individual, and recording the position and the fitness value of the individual.
fitness=Pprecision(C,σ) n=(1,2,…,p) (15)
[bestfitness,bestlocation]=max(fitness) (16)
[worstfitness,worstlocation]=min(fitness) (17)
In the above formula, n is the individual label, the fitness value set of all the individuals is the fixness, f (x) is the objective function, bestfittess is the maximum fitness value, bestfitterion is the position of the optimal individual, Worstfixss is the minimum fitness value, and Worstresidence is the position of the worst individual.
And 3.4, calculating the distances between all drosophila individuals and the optimal individual and the worst individual, if the distances between the drosophila individuals and the optimal individual are shorter than the distances between the drosophila individuals and the worst individual, dividing the drosophila individuals into subgroups with stronger searching capability, and otherwise, dividing the drosophila subgroups into subgroups with poorer searching capability.
Figure BDA0002485155840000131
Figure BDA0002485155840000132
In the above formula, distancebestIs the distance, X, between an individual and an optimal individualbestlocationLocation, distance, for optimal individualsworstIs the distance, X, between an individual and the worst individualworstlocationThe location of the worst individual.
And 3.5, searching the drosophila subgroups with stronger searching capability and poorer searching capability under the guidance of the optimal individual according to respective radiuses, and updating the positions.
X[]best=X[]bestlocation+Rbest(Random-0.5)·π (20)
X[]worst=X[]bestlocation+Rworst(Random-0.5)·π (21)
Wherein:
Figure BDA0002485155840000133
Figure BDA0002485155840000134
in the above formula, XbestFor searching for the position coordinate, R, of a particular body among the subgroups of greater powerbestSearch radius for individuals belonging to a subgroup with stronger search power, XworstFor searching for the position coordinate, R, of an individual among subgroups of poorer powerworstRepresenting the search radius, g, of individuals belonging to subgroups with poor search abilityiIndicates the current number of iterations, fitnessiIndicating the fitness value, fitness, of the current individuali+1Representing the fitness value of the previous generation of individuals, m and n being constants.
And 3.6, calculating the taste concentration judgment values of all the drosophila individuals after the positions are updated, sequentially inputting the taste concentration judgment values into the SVM as C and sigma, training and testing, and calculating the classification accuracy as a new individual fitness value. And then, recording the positions and the fitness values of the new optimal and worst individuals, and if the fitness value of the optimal individual is smaller than the value of the previous generation, the position of the optimal individual is still extended to the position of the previous generation.
3.7, entering an iterative process of the algorithm, repeating the steps 3.4 to 3.6, if the maximum iterative times is reached, finishing the algorithm, and outputting a taste concentration judgment value of the last generation of the optimal individual as the optimal parameter of the SVM.
Step 4. model usage
And 4.1, preprocessing the complaint report text to be classified, wherein the preprocessing steps are the same as 1.2-1.4.
And 4.2, inputting the preprocessed complaint report text into the SVM which is optimized by the parameters.
And 4.3, acquiring the output of the SVM, namely the category to which the complaint report text belongs.
The classification result shows that the classification accuracy of the method on the environmental pollution complaint report text can reach 91.0%, the recall rate can also reach 90.3%, and the method can meet the practical application.

Claims (1)

1. A complaint report text classification method based on SVM parameter optimization is characterized by comprising the following steps: the method comprises the following steps:
step 1: text acquisition and preprocessing:
step 1.1, obtaining a certain number of complaint report texts and categories thereof, ensuring that category labels are accurate, and then dividing the texts into a training set text and a test set text according to a certain proportion;
step 1.2, performing word segmentation and stop word removal on all texts by using a jieba toolkit of python;
step 1.3, modeling the text subjected to word segmentation and stop word removal by adopting a space vector model (VSM), and expressing the text as a space vector, wherein the expression of the space vector is as follows:
Di=D(t1,w1;t2,w2;…;tn,wn),i∈N+,n∈N+(1)
in the above formula, D represents a space vector of a certain text, i represents a number of the certain text, t represents a sub-vector corresponding to a certain word in the certain text, w represents a weight of the sub-vector, and n is a label of the sub-vector;
1.4, using a word frequency-inverse document frequency (TF-IDF) algorithm to perform feature extraction and dimension reduction on a text space vector, wherein the method comprises the following steps:
the TF-IDF comprehensively considers the frequency of a single word appearing in a single document and the total frequency of the word appearing in a text set, has higher statistical precision, and has the following calculation formula:
Pi=tfij×idfi(2)
in the above formula, PiRepresenting the comprehensive frequency of a certain word, tf representing the frequency of the word appearing in a document, idf representing the proportion of the document containing the word in the whole text set, i representing the label of the word, and j representing the label of the document; if P of the word is less than a certain numerical value, considering that the comprehensive frequency is low, and rejecting the word;
because the text is converted into a space vector form, the comprehensive frequency of a certain word, namely the comprehensive frequency of the sub-vector corresponding to the word in the whole space vector, achieves the effect of reducing the dimension by eliminating the sub-vector with low frequency; in step 1.3, DiAfter dimension reduction, the method can be expressed as follows:
D′i=D(t1,w1;t2,w2;…;tk,wk),i∈N+,k∈N+and k < n (3)
In the above formula, D'iIs less than Di
Step 2.SVM training
The general form of a linear discriminant function is:
g(x)=wTx+b=0 (4)
the points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (5)
in the above formula, w is a normal vector of the classification plane, b is an offset, and x is (x, y);
the points that satisfy the linear discriminant function are:
yi[w·xi+b]-1≥0,i=1,2,…,n (6)
the optimal classification surface can also be expressed as the minimum of the following function under the above constraint:
Figure FDA0002485155830000021
therefore, introduce the Lagrange function:
Figure FDA0002485155830000022
the problem is converted into a dual problem of solving the minimum value of the Lagrange function for w and b, namely convex quadratic programming; by solving the dual problem of the convex quadratic programming, a classification surface function can be finally obtained:
Figure FDA0002485155830000023
as the Gaussian Radial Basis Function (RBF) has stronger mapping capability, the invention selects the RBF as the kernel function of the SVM:
Figure FDA0002485155830000024
in the above formula, σ is a width parameter of the RBF, and represents control over the radial range; if the sigma is smaller than the minimum distance between all training samples, all samples are support vectors and can be classified correctly; on the contrary, all samples are classified into one class, so that the learning ability of the samples is lost;
after a kernel function is introduced, the problem of nonlinear classification is solved, but some linear inseparable problems still exist in a transformed sample space, and a linear discriminant function is difficult to satisfy; therefore, in order to make the classifier have a certain fault tolerance, a relaxation variable is introduced, so that the conditions are met as follows:
yi[w·xi+b]-1+≥0,i=1,2,…,n (11)
in order to control the integral error number and ensure the classification precision, a penalty factor C is introduced, so that the constraint condition is increased:
Figure FDA0002485155830000031
in the above formula, the larger C is, the greater the punishment degree of the misclassification is, but the generalization ability of the classification is reduced; otherwise, the classification precision is reduced;
therefore, the SVM training steps are as follows:
step 2.1, determining the value sizes of C and sigma;
2.2, inputting the preprocessed training set text into the SVM for training, and substituting C and sigma;
step 3, SVM parameter optimization
FOA is an optimization algorithm based on the fruit fly foraging principle; the invention provides an IFOA optimization algorithm aiming at the defects of FOA, which comprises the following calculation steps:
(1) parameter of the initialization algorithm, i.e. maximum number of iterations gmaxThe population scale p, the initial search radius R and the initial position coordinate X of the drosophila individual;
X=(Random-0.5)·π (13)
in the above formula, Random is a Random number between (0,1), and X is a position coordinate value of an individual;
(2) calculating the taste concentration judgment value of all drosophila individuals:
S=tan(X) (14)
in the above formula, S is a taste concentration determination value of an individual;
(3) sequentially bringing the taste concentration judgment values of all fruit fly individuals into an objective function (to-be-optimized problem), obtaining the fitness values of the individuals, selecting the individuals with the maximum and minimum fitness values, namely the optimal individual and the worst individual, and recording the positions and the fitness values of the individuals:
fitness=f(Sn) n=(1,2,…,p) (15)
[bestfitness,bestlocation]=max(finess) (16)
[worstfitness,worstlocation]=min(fitness) (17)
in the above formula, n is an individual label, the fitness value set of all individuals is fixed as the fixness, f (x) is an objective function, the fixness is the maximum fitness value, the bestlocation is the position of the optimal individual, the workfixness is the minimum fitness value, and the worklocation is the position of the worst individual;
(4) calculating the distances between all drosophila individuals and the optimal individual and the worst individual, if the distances between the drosophila individuals and the optimal individual are shorter than the distances between the drosophila individuals and the worst individual, dividing the drosophila individuals into subgroups with stronger searching capability, and otherwise, dividing the drosophila individuals into subgroups with poorer searching capability;
Figure FDA0002485155830000041
Figure FDA0002485155830000042
in the above formula, distancebestIs the distance, X, between an individual and an optimal individualbestlocationLocation, distance, for optimal individualsworstIs the distance, X, between an individual and the worst individualworstlocationThe location of the worst individual;
(5) the fruit fly subgroups with stronger searching capability and poorer searching capability are respectively searched under the guidance of the optimal individual according to different radiuses, and the positions are updated;
Xbest=Xbestlocation+Rbest(Random-0.5)·π (20)
Xworst=Xbestlocation+Rworst(Random-0.5)·π (21)
wherein:
Figure FDA0002485155830000051
Figure FDA0002485155830000052
in the above formula, XbestFor searching for the position coordinate, R, of a particular body among the subgroups of greater powerbestSearch radius for individuals belonging to a subgroup with stronger search power, XworstFor searching for the position coordinate, R, of an individual among subgroups of poorer powerworstRepresenting the search radius, g, of individuals belonging to subgroups with poor search abilityiIndicates the current number of iterations, fitnessiIndicating the fitness value, fitness, of the current individuali+1Representing the fitness value of the previous generation of individuals, wherein m and n are constants;
(6) calculating taste concentration judgment values and fitness values of all drosophila individuals after the positions are updated, recording the positions and fitness values of new optimal and worst individuals, and if the fitness value of the optimal individual is smaller than the value of the previous generation, the position of the optimal individual is still extended to the position of the previous generation;
(7) entering an iterative process of the algorithm, repeating the steps (4) to (6), if the maximum iterative times is reached, finishing the algorithm, and outputting a taste concentration judgment value of the last generation of the optimal individual, namely the optimal solution of the objective function;
thus, the steps of optimizing parameters of the SVM by IFOA are as follows:
step 3.1, initializing parameters of the IFOA, including population scale, maximum iteration times, initial search radius, initial position coordinates of the fruit flies and values of m and n;
step 3.2, calculating the taste concentration judgment values of all the drosophila individuals, wherein the values are also two-dimensional arrays;
step 3.3, using two elements in the taste concentration judgment value of the fruit fly individual as C and sigma, inputting the C and sigma into SVM in turn, and training by using the preprocessed training set textThen, carrying out classification test on the SVM by adopting the preprocessed test set text; at this time, the target function of IFOA will be replaced with a classification accuracy function with C and sigma as arguments, i.e., Pprecision(C, σ); using a function PprecisionCalculating the classification accuracy of the SVM to be used as the fitness value of the individual, simultaneously selecting the individual with the maximum and minimum fitness values, namely the optimal individual and the worst individual, and recording the position and the fitness value of the individual;
step 3.4, calculating the distances between all drosophila individuals and the optimal individual and the worst individual, if the distances between the drosophila individuals and the optimal individual are shorter than the distances between the drosophila individuals and the worst individual, dividing the drosophila individuals into subgroups with stronger searching capability, otherwise, dividing the drosophila individuals into subgroups with poorer searching capability;
3.5, searching the drosophila subgroups with stronger searching ability and poorer searching ability under the guidance of the optimal individual according to respective radiuses, and updating positions;
step 3.6, calculating taste concentration judgment values of all drosophila individuals after the positions are updated, sequentially inputting the taste concentration judgment values as C and sigma into the SVM, training and testing, and calculating classification accuracy as a new individual fitness value; then, recording the positions and the fitness values of the new optimal and worst individuals, and if the fitness value of the optimal individual is smaller than the value of the previous generation, the position of the optimal individual is still extended to the position of the previous generation;
step 3.7, entering an iterative process of the algorithm, repeating the steps 3.4 to 3.6, if the maximum iterative times is reached, finishing the algorithm, and outputting a taste concentration judgment value of the last generation of the optimal individual as an optimal parameter of the SVM;
step 4. model usage
Step 4.1, preprocessing the complaint report text to be classified, wherein the preprocessing steps are the same as 1.2 to 1.4;
step 4.2, inputting the preprocessed complaint report text into the SVM which is optimized by the parameters;
and 4.3, acquiring the output of the SVM, namely the category to which the complaint report text belongs.
CN202010389257.6A 2020-05-10 2020-05-10 Complaint report text classification method based on SVM parameter optimization Pending CN111753083A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010389257.6A CN111753083A (en) 2020-05-10 2020-05-10 Complaint report text classification method based on SVM parameter optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010389257.6A CN111753083A (en) 2020-05-10 2020-05-10 Complaint report text classification method based on SVM parameter optimization

Publications (1)

Publication Number Publication Date
CN111753083A true CN111753083A (en) 2020-10-09

Family

ID=72673364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010389257.6A Pending CN111753083A (en) 2020-05-10 2020-05-10 Complaint report text classification method based on SVM parameter optimization

Country Status (1)

Country Link
CN (1) CN111753083A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064962A (en) * 2021-03-16 2021-07-02 北京工业大学 Environment complaint reporting event similarity analysis method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330781A (en) * 2017-06-19 2017-11-07 南京信息工程大学 A kind of individual credit risk appraisal procedure based on IFOA SVM
CN108710651A (en) * 2018-05-08 2018-10-26 华南理工大学 A kind of large scale customer complaint data automatic classification method
CN109062180A (en) * 2018-07-25 2018-12-21 国网江苏省电力有限公司检修分公司 A kind of oil-immersed electric reactor method for diagnosing faults based on IFOA optimization SVM model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330781A (en) * 2017-06-19 2017-11-07 南京信息工程大学 A kind of individual credit risk appraisal procedure based on IFOA SVM
CN108710651A (en) * 2018-05-08 2018-10-26 华南理工大学 A kind of large scale customer complaint data automatic classification method
CN109062180A (en) * 2018-07-25 2018-12-21 国网江苏省电力有限公司检修分公司 A kind of oil-immersed electric reactor method for diagnosing faults based on IFOA optimization SVM model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张前图 等: "基于改进 FOA 的 SVM 参数优化研究", 价值工程, 31 December 2016 (2016-12-31), pages 218 - 221 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064962A (en) * 2021-03-16 2021-07-02 北京工业大学 Environment complaint reporting event similarity analysis method
CN113064962B (en) * 2021-03-16 2024-03-15 北京工业大学 Environment complaint reporting event similarity analysis method

Similar Documents

Publication Publication Date Title
CN109165294B (en) Short text classification method based on Bayesian classification
Chellapandi et al. Comparison of pre-trained models using transfer learning for detecting plant disease
Lin et al. Tourism demand forecasting: Econometric model based on multivariate adaptive regression splines, artificial neural network and support vector regression
CN110866782A (en) Customer classification method and system and electronic equipment
CN113111924A (en) Electric power customer classification method and device
Tung et al. Binary classification and data analysis for modeling calendar anomalies in financial markets
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
Vo et al. Time series trend analysis based on k-means and support vector machine
Gurrib et al. Bitcoin price forecasting: Linear discriminant analysis with sentiment evaluation
CN111626331B (en) Automatic industry classification device and working method thereof
CN111753083A (en) Complaint report text classification method based on SVM parameter optimization
Teoh et al. Artificial Intelligence in Business Management
CN111507528A (en) Stock long-term trend prediction method based on CNN-L STM
Kostkina et al. Document categorization based on usage of features reduction with synonyms clustering in weak semantic map
Xiong et al. L-RBF: A customer churn prediction model based on lasso+ RBF
CN114091961A (en) Power enterprise supplier evaluation method based on semi-supervised SVM
Anastasopoulos et al. Computational text analysis for public management research: An annotated application to county budgets
Arsirii et al. Heuristic models and methods for application of the kohonen neural network in the intellectual system of medical-sociological monitoring
CN111626376A (en) Domain adaptation method and system based on discrimination joint probability
CN113570455A (en) Stock recommendation method and device, computer equipment and storage medium
Kalaiselvi et al. Modified Extreme Learning Machine Algorithm with Deterministic Weight Modification for Investment Decisions based on Sentiment Analysis
Ning et al. Manufacturing cost estimation based on similarity
Özarı et al. Forecasting sustainable development level of selected Asian countries using M-EDAS and k-NN algorithm
Tawheed et al. Application of Machine Learning Techniques in the Context of Livestock
CN117828075A (en) Agricultural condition data classification method, agricultural condition data classification device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination