CN107992613A

CN107992613A - A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning

Info

Publication number: CN107992613A
Application number: CN201711366440.9A
Authority: CN
Inventors: 王晓佳; 蔡文鑫
Original assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Current assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2018-05-04

Abstract

The present invention provides a kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning, it is characterised in that includes the following steps：Step 1) data acquisition and processing simultaneously generate normalized number evidence：The normalized number evidence that step 2) is obtained using extraction module acquisition data acquisition module, and obtain Sentiment orientation degree using data processing module；Protection of consumers' rights data after step 3) is analyzed Sentiment orientation degree using sort module are grouped and weight classification；Step 4) is calculated using processor into row index scoring rate；Step 5) builds SVM classifier；Step 6) builds train classification models；Step 7) carries out model prediction by step 6).With reference to newest emotion tendency degree analysis calculation method and combine business datum and calculate, quick, science, standard, objective result of calculation can be reached, provides the calculating of protection of consumers' rights index for law enforcement agency, decision assistant of enforcing the law is provided for administrative department.

Description

Machine learning-based consumption right-maintaining index analysis method for text mining technology

Technical Field

The invention relates to a consumption right index calculation technology, in particular to a consumption right index analysis method based on a text mining technology of machine learning, which is mainly applied to a government stability maintenance monitoring system, a complaint reporting data analysis and research system, a 12345 government affairs service hotline system, a public security system, an industrial and commercial bureau system and an enterprise credit system.

Background

With the rapid development of socioeconomic in recent years, 12345 government affairs service hotlines accept partial consumer consultation, complaints, reports, opinions and suggestions, are 'investigation research' sent by the masses, and are big data which must be referred by the research strengthening and improvement of market supervision of the administration of the industrial and commercial government. The complaint types account for more than half of the number of 12345 work orders, text processing and analysis are carried out on the complaint work orders, and the establishment of an enterprise complaint risk early warning model is an effective exploration and important way for realizing the conversion from market 'big data' to market supervision capacity.

The government service hotline center of the industrial and commercial administrative authority 12345 directly faces to mass consumers, is close to people and markets, and can generate a large amount of data information every day. The data information is real-time feedback of vast consumers to the current market problems, can timely and accurately reflect the change characteristics and rules of consumption hotspots, reflect the integrity management conditions, the goods and service quality conditions of market main bodies and the fair market transaction order conditions, is a thermometer and a weather meter of market supervision quality, and is a primary officer for evaluating the market supervision effect of a manager of an industrial and commercial administration.

Therefore, the analysis and utilization of 12345 data are enhanced, especially the risk level classification of the complaint enterprises, the data speaking is adhered to, and the method has great significance for guiding the pertinence and timeliness of law enforcement of related departments.

The traditional calculation method of the consumption right index comprises the following steps: the consumption right-maintaining index can be established as a basic period at a certain period, eight key commodities and services such as clothes, shoes, hats, hairdressing and beauty, household articles, household appliances, computer products, communication equipment, vehicles, agricultural production data and network transaction are selected as monitoring fields, the total amount of the commodities and service related cases reported by the complaints of consumers is calculated, and then a mathematical model is established: the expense right index = the sum of each involved case of eight major categories of goods or services in the month divided by the total amount of the involved cases of the same type of goods or services in the basic period (a certain period) multiplied by 100.

The disadvantages of this algorithm are: the defect 1 is that only the involved amount is considered as a unique index, the calculation result is often influenced by data with high involved amount, and other considerations are lacked: plan-setting proportion, number of persons involved, complaint person proportion, abnormal enterprise complaint proportion and the like; disadvantage 2, due to the huge amount of data involved, adds a number of data dimensions, which brings several difficulties to manual computation.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a method for analyzing consumption right indicators of text mining technology based on machine learning.

The specific scheme is as follows:

a text mining technology consumption right-maintaining index analysis method based on machine learning comprises the following steps:

step 1), data acquisition and processing are carried out to generate normalized data:

s1: acquiring a work order text of a 12345 hot line by using a data acquisition module; performing an initial processing step on the work order text acquired by the data acquisition module by using a processor, and processing the input text data by using a word segmentation module and a denoising module to acquire keywords in each text data; the data acquisition module is connected with the processor, and a word segmentation module and a denoising module are arranged in the processor;

s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by the vectorization normalization result of each keyword W in a file d;

s3: setting a text data similarity matching module, calculating cosine similarity among the keywords by using a cosine law according to the established real numerical matrix, and matching text data with the closest cosine distance to form a word bank of similar words;

s4: setting a normalized data generation module, and generating normalized data from the matched text data according to a set unified normalized mode;

step 2), acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module;

step 3) grouping and weight classifying the consumption right data after the emotional tendency degree analysis by using a classification module;

step 4), utilizing a processor to calculate the index score;

step 5), constructing an SVM classifier;

step 6), constructing a training classification model;

step 7) model prediction is performed by step 6).

Preferably, the specific steps of step 2) obtaining the emotional tendency degree by using the data processing module are as follows:

s1: establishing a conjunctive and negative word dictionary base;

s2: extracting conjunctions and negative words from the normalized data according to the conjunctions and negative words dictionary base, and marking the positions of the corresponding words in the normalized data;

s3: matching the existing emotion dictionary library to obtain the polarity of the vocabulary and the emotion score value thereof;

s4: determining the proportion of the front sentence and the rear sentence according to the position of the conjunctive word, and judging double negation and polarity inversion of adjacent words according to the position of the negative word;

s5: carrying out accumulation on normalized data to obtain emotion calculation scores after words are substituted into conjunctions and negative words by utilizing the polarity of the words and the emotion score values of the words;

s6: looping the steps S2 to S5, if the emotion calculation score is positive, the emotion calculation score is negative, and if the emotion calculation score is negative, the emotion calculation score is positive;

s7: the complaint risk grade rule is used for dividing the complaint risk grade into five risk grades, namely a high risk grade, a general complaint grade, a complaint tendency grade and a non-complaint tendency grade according to the clustering result;

s8: improving the accuracy of the definition result: the rules need to be verified when complaint risk level rules are formulated, the data of a verification set is input into an established classification learning model, the complaint risk level of each text data in the data of the verification set is obtained, a result set is compared with the corresponding level data in the data of the verification set, and the accuracy rate of model prediction belongs to { -1,1}.

Preferably, the step 4) of calculating the index score ratio by using the processor includes the following specific steps:

s1: setting the consumption right index to have m evaluation fields and n experts participating in evaluation, and setting the score set given by one expert k as X _i (j))} ^(k) In the formula { X _i (j))} ^(k) Represents the scoring rank score of the k =1,2 ·, n experts for the i (i =1,2 ·, n) domain, with a value j (j =1,2 ·, m);

s2: according toThe order score set can be converted into a base score set B _i (j))} ^(k) Wherein

According toThe order score set can be converted into a base score set B _i(j) } ^(k) Wherein { B _i(j) } ^(k) The base number score corresponding to the j-th domain of the kth expert is represented;

then, the degree of importance of each research field was calculated using the following formula:

in the formulas (2) and (3), m is in the formulas (2) and (3), and m represents the number of domains; si represents the i-domain score value; n represents the number of experts; b is _i(j) The score value of i field arranged at j bit is represented; n is a radical of _i Indicating approval of a person in a domain ranked at the jth position

S3: scoring the m items S calculated in S2 _i Combined into one m-dimensional vector x = [ a ] ₁ ，a ₂ ，…，a _m ]The vector represents the scoring condition of a complaint work order. Calculating the scoring condition vector of all complaint work orders to form a whole data set Z = { Z = _n ,n∈R}；

S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of randomly-extracted data in the whole data set after marking the result to form a training data set S _train ＝{(z _n ,y _n )|z _n ∈Z,z _n E { -1,1}, n =1,2,3 \8230; } wherein z _n The score ratio vector, y, for the nth work order _n The fractional results corresponding to the nth order, the remaining 40% being the test data set S _test ＝{(z _m ,y _m )|z _m ∈Z，y _m E { -1,1}, m =1,2,3 \8230; } wherein z _m Is the score vector of the m-th work order, y _m The score is obtained for the mth work order.

Preferably, the specific steps of constructing the SVM classifier in step 5) are as follows:

s1: let a given training set be { (z) ₁ ,y ₁ ),(z ₂ ,y ₂ ),…,(z _n ,y _n ) Wherein z ⁱ ∈R ⁿ As an input vector, y ⁱ E { -1,1} is an output vector, and assuming that the training set can be linearly divided by a hyperplane W · Z + b =0, the problem is converted into an optimization hyperplane problem:

s2: for the case of non-linear divisibility, the low-dimensional input space R can be transformed by a mapping function (called kernel function in SVM) ⁿ Mapping to a high-dimensional feature space H to make the linearity separable, the optimization problem is transformed into

S3: solving the optimization function of equation (5) as:

from the minimization problem of equation (5), selecting appropriate functions K (·) and C can determine the SVM classifier;

s4: using RBF radial basis kernel function, i.e. K (Z) _i ，Z _j )＝exp(-γ||Z _i -Z _j ||) ² The optimization problem of the classifier is finally transformed into the selection problem of the parameter pair (C, γ).

Preferably, the specific steps of step 6) are as follows:

s1, constructing parameter pairs consisting of all C and gamma in a value range by taking C more than or equal to 1 and less than or equal to 1000 and gamma more than or equal to 0 and less than or equal to 100 as ranges;

s2, sequentially taking parameter pairs (C, gamma) as initial values of parameters of the SVM classifier based on the RBF kernel function, and training a data vector set S _train And a set of test data vectors S _test Recording the accuracy of prediction on the test data set under the classification model as pt;

s3, aiming at the accuracy rate of the test set, adjusting different C value and gamma value parameter pairs in the SVM test algorithm by using a computer program to ensure that the accuracy rate p is _t Reaching the predetermined required accuracy p0 and keeping record of this accuracy p ₀ Model parameter pair of (C) ₀ ,γ ₀ ) I.e. this pair of parameters is the SVM score soughtModel parameters of the class model.

Preferably, the specific steps of step 7) are as follows:

s1: constructing a work order score ratio data vector x' without a marking result;

s2: model parameters (C) obtained by training in the third step are adopted ₀ ,γ ₀ ) SVM classification prediction is carried out on x ', and an output result y' epsilon { -1,1 }is obtained

S3: and y' is the calculation result of the consumption right index model of the soft work order emotion degree and the service data.

In order to solve the defects of the algorithm, the invention brings other dimensions related to a consumption right index system into a calculation range, and ensures the accuracy and science of the result by applying a big data classification algorithm and an expert scoring model. The invention provides a consumption right-maintaining index model based on text mining technology, which aims at consumption right-maintaining complainers and complaining units recorded by 12345 work order data, accurately masters hot spot problems complained, analyzes complaint emotional tendency, complaint frequency, service types and the like, and then establishes the consumption right-maintaining index model by utilizing a big data classification algorithm.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flowchart of the present invention for sorting and carding work order data;

FIG. 3 is a flow chart of model verification according to the present invention.

Detailed Description

The present invention is further illustrated in detail by the following examples.

Interpretation of terms:

emotional tendency degree algorithm:

the method is characterized in that the emotional tendency research of the text is an active branch in the current computer field, the intelligent information field and the natural language processing field, and the emotional tendency of the comment object in the text can be analyzed to judge whether the emotional tendency of the comment object is legal or ambiguous. Text emotion tendency analysis is widely applied to the fields of information retrieval, information filtering, emotion recognition and the like.

Big data classification algorithm:

the big data classification algorithm is an algorithm which mainly uses a support vector machine, a neural network and the like to find out common characteristics of a group of data objects in a database and divide the data objects into different classes according to a classification mode. The goal is to map data items in a database into a given category through a classification model.

Support vector machine:

in machine learning, support vector machines (SVMs, and also support vector networks) are supervised learning models associated with associated learning algorithms that can analyze data, identify patterns, and use them for classification and regression analysis. Given a set of training samples, each label belongs to two classes, an SVM training algorithm builds a model, assigns new instances to one class or other classes, and makes them non-probabilistic binary linear classification.

The technical scheme provided by the invention is as follows:

a consumption right index analysis method for a text mining technology based on machine learning comprises the following steps:

s1: acquiring a work order text of a 12345 hot line by using a data acquisition module; performing an initial processing step on the work order text acquired by the data acquisition module by using a processor, and processing the input text data by using a word segmentation module and a denoising module to acquire keywords in each text data; the system comprises a data acquisition module, a processor, a word segmentation module and a denoising module, wherein the data acquisition module is connected with the processor;

s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by vectorization normalization results of each keyword W in a file d; the vectorization normalization result of the keyword W in the file d is as follows:

<img file＝"DEST_PATH_GDA0001183237700000021.GIF"wi＝"700"he＝"168"img-content＝"drawing"img-format＝"GIF"orientation＝"portrait"inline＝"no"/>。

s4: setting a normalized data generation module, and generating the matched text data into normalized data according to a set unified normalized mode;

step 2) acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module; the method comprises the following specific steps:

s1: establishing a conjunctive and negative word dictionary base;

s8: improving the accuracy of the definition result: and verifying the rules when the complaint risk grade rules are formulated, inputting the data of the verification set into the established classification learning model to obtain the complaint risk grade of each text data in the data of the verification set, comparing the result set with the corresponding grade data in the data of the verification set, and calculating the accuracy rate of model prediction to be E { -1,1}.

classifying 12345 hot wires acquired by the data acquisition module through the embedded classification module according to the work order content and the keywords to obtain a secondary index and classifying the important grades in groups; (different sub items in a system are classified, the overall influence on the consumption right index result by the items is divided into three levels, namely, a large influence item, a general influence item and a small influence item) to be classified in groups.

Referring to table 1, table 1 provides a method of classification.

Table 1 is a schematic table of examples provided in 14 evaluation fields.

Step 4), calculating the index score ratio by using a processor;

the method comprises the following specific steps:

s1: setting the consumption right index to have m evaluation fields and n experts participating in evaluation, and setting the score set given by one expert k as X _i (j))} ^(k) In the formula { X _i (j))} ^(k) Represents the scoring rank score of the (i) field of the (k =1,2,) n experts, whose value is j (j =1,2,) m);

s2: according toThe order score set can be converted into a base score set B _i (j))} ^(k) In which

According toThe order score set can be converted into a base score set B _i(j) } ^(k) In which { B } _i(j) } ^(k) The base number score corresponding to the j-th domain of the kth expert is represented; then, the degree of importance of each research field was calculated using the following formula:

in the formulas (2) and (3), m is in the formulas (2) and (3), and m represents the number of domains; s _i Representing the i domain score value; n represents the number of experts; b is _i(j) Representing the score value of j in the i field; n is a radical of hydrogen _i Indicating approval of a person in a domain ranked at the jth position

S3: scoring the m items S calculated in S2 _i Combined into one m-dimensional vector x = [ a ] ₁ ，a ₂ ，…，a _m ]The vector represents the scoring of a complaint work order. Calculating the scoring condition vector of all complaint work orders to form a whole data set Z = { Z = _n ,n∈R}；

S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of data randomly taken out from the whole data set after marking the result to form a training data set S _train ＝{(z _n ,y _n )|z _n ∈Z,z _n E { -1,1}, n =1,2,3 \8230 } wherein z _n The score ratio vector, y, for the nth work order _n The score ratio result corresponding to the nth order is shown in the remaining 40% as the test data set S _test ＝{(z _m ,y _m )|z _m ∈Z，y _m E { -1,1}, m =1,2,3 \8230; } wherein z _m Is the score vector of the m-th work order, y _m The score ratio result corresponding to the m-th work order is shown.

Step 5), constructing an SVM classifier; the method comprises the following specific steps:

s2: for the case of non-linear divisibility, the low-dimensional input space R can be transformed by a mapping function (called kernel function in SVM) ⁿ Mapping to a high-dimensional feature space H to make the linearity separable, then the optimization problem is transformed into

S3: solving the optimization function of equation (5) as:

from the minimization problem of equation (5), selecting appropriate functions K (-) and C determines the SVM classifier;

s4: using RBF radial basis kernel function, i.e. K (Z) _i ，Z _j )＝exp(-Υ||Z _i -Z _j ||) ² Then, thenThe optimization problem of the classifier is finally converted into the selection problem of the parameter pair (C, y).

Step 6), constructing a training classification model; the method comprises the following specific steps:

s3, aiming at the accuracy rate of the test set, adjusting different C value and gamma value parameter pairs in the SVM test algorithm by using a computer program to ensure that the accuracy rate p is _t Reaching the predetermined required accuracy p0 and keeping record of this accuracy p ₀ Model parameter pair of (C) ₀ ,γ ₀ ) I.e. this parameter pair is the model parameter of the SVM classification model sought.

Step 7) performing model prediction through the step 6); the method comprises the following specific steps:

s2: model parameters (C) obtained by training in the third step are adopted ₀ ,γ ₀ ) SVM classification prediction is carried out on the x ', and an output result y' belongs to { -1,1}

In the traditional mode, the calculation of the consumption right maintaining index greatly depends on the amount data related to the consumption right maintaining, and the data dimension is single and has no practical universality. The invention can process thousands of 12345 work order data by utilizing the existing computer processing speed, classify and analyze the work order contents, combine the latest emotional tendency analyzing and calculating method and the service data calculation, can achieve quick, scientific, standard and objective calculation results, provide consumption right index calculation for law enforcement departments, and provide law enforcement decision assistance for management departments.

Claims

1. A text mining technology consumption right-maintaining index analysis method based on machine learning is characterized by comprising the following steps:

s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by vectorization normalization results of each keyword W in a file d;

step 2) acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module;

step 4), calculating the index score ratio by using a processor;

step 5), constructing an SVM classifier;

step 6), constructing a training classification model;

step 7) model prediction is performed by step 6).

2. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,

step 2) the specific steps of obtaining the emotional tendency degree by utilizing the data processing module are as follows:

s1: establishing a conjunctive and negative word dictionary base;

s3: matching the existing emotion dictionary library to obtain the polarity and emotion score value of the vocabulary;

s5: substituting the polarity of the vocabulary and the emotion score value thereof into conjunctions and negations, and accumulating the normalized data to obtain emotion calculation score;

s6: the steps S2 to S5 are circulated, if the emotion calculation score is positive, the emotion calculation score is negative, and if the emotion calculation score is negative, the emotion calculation score is central;

3. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,

step 4) the specific steps of calculating the index score by using the processor are as follows:

s1: setting a consumption right index, wherein m evaluation fields exist, n experts participate in evaluation, and a special purpose is setThe score set given by family k is X _i (j))} ^(k) In the formula { X _i (j))} ^(k) Representing the rank scores of n experts for the i (i =1,2, \8230;, n) domain with a value of j (j =1,2, \8230;, m);

According toThe order score set can be converted into a base score set B _i(j) } ^(k) Wherein { B _i(j) } ^(k) The base number score corresponding to the j-th domain of the kth expert is represented; then, the importance degree of each research field was calculated by the following formula:

in the formulae (2) and (3), m is in the formulae (2) and (3), and m represents the number of domains; s _i Representing the i domain score value; n represents the number of experts; b is _i(j) Representing the score value of j in the i field; n is a radical of _i Indicating approval of a person in a domain ranked at the jth position

S3: scoring the score of all m items S calculated in S2 _i Combined into one m-dimensional vector x = [ a ] ₁ ，a ₂ ，…，a _m ]The vector represents the scoring condition of one complaint work order, and the scoring condition vectors of all the complaint work orders are calculated to form a wholeVolume data set Z = { Z = _n ,n∈R}；

S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of randomly-extracted data in the whole data set after marking the result to form a training data set S _train ＝{(z _n ,y _n )|z _n ∈Z,z _n E { -1,1}, n =1,2,3 \8230; } wherein z _n Is the fractional vector of the nth order, y _n The fractional results corresponding to the nth order, the remaining 40% being the test data set S _test ＝{(z _m ,y _m )|z _m ∈Z，y _m E { -1,1}, m =1,2,3 \8230; } wherein z _m Is the score vector of the m-th work order, y _m The score is obtained for the mth work order.

4. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,

step 5) the specific steps of constructing the SVM classifier are as follows:

S3: solving the optimization function of equation (5) as:

s4: using RBF radial basis kernel function, i.e. K (Z) _i ，Z _j )＝exp(-γ||Z _i -Z _j ||) ² The optimization problem of the classifier is finally converted into the selection problem of the parameter pair (C, γ).

5. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,

step 6) comprises the following specific steps:

6. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,

step 7) comprises the following specific steps:

s2: model parameters (C) obtained by training in the third step are adopted ₀ ,γ ₀ ) Performing SVM classification prediction on x 'to obtain an output result y' belonging to { -1,1};