CN107992613A - A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning - Google Patents

A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning Download PDF

Info

Publication number
CN107992613A
CN107992613A CN201711366440.9A CN201711366440A CN107992613A CN 107992613 A CN107992613 A CN 107992613A CN 201711366440 A CN201711366440 A CN 201711366440A CN 107992613 A CN107992613 A CN 107992613A
Authority
CN
China
Prior art keywords
data
score
module
text
complaint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711366440.9A
Other languages
Chinese (zh)
Inventor
王晓佳
蔡文鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Original Assignee
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Industry Kaiyuan Science And Technology Co Ltd filed Critical Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority to CN201711366440.9A priority Critical patent/CN107992613A/en
Publication of CN107992613A publication Critical patent/CN107992613A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Strategic Management (AREA)
  • Computational Linguistics (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Finance (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Databases & Information Systems (AREA)
  • Educational Administration (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning, it is characterised in that includes the following steps:Step 1) data acquisition and processing simultaneously generate normalized number evidence:The normalized number evidence that step 2) is obtained using extraction module acquisition data acquisition module, and obtain Sentiment orientation degree using data processing module;Protection of consumers' rights data after step 3) is analyzed Sentiment orientation degree using sort module are grouped and weight classification;Step 4) is calculated using processor into row index scoring rate;Step 5) builds SVM classifier;Step 6) builds train classification models;Step 7) carries out model prediction by step 6).With reference to newest emotion tendency degree analysis calculation method and combine business datum and calculate, quick, science, standard, objective result of calculation can be reached, provides the calculating of protection of consumers' rights index for law enforcement agency, decision assistant of enforcing the law is provided for administrative department.

Description

Machine learning-based consumption right-maintaining index analysis method for text mining technology
Technical Field
The invention relates to a consumption right index calculation technology, in particular to a consumption right index analysis method based on a text mining technology of machine learning, which is mainly applied to a government stability maintenance monitoring system, a complaint reporting data analysis and research system, a 12345 government affairs service hotline system, a public security system, an industrial and commercial bureau system and an enterprise credit system.
Background
With the rapid development of socioeconomic in recent years, 12345 government affairs service hotlines accept partial consumer consultation, complaints, reports, opinions and suggestions, are 'investigation research' sent by the masses, and are big data which must be referred by the research strengthening and improvement of market supervision of the administration of the industrial and commercial government. The complaint types account for more than half of the number of 12345 work orders, text processing and analysis are carried out on the complaint work orders, and the establishment of an enterprise complaint risk early warning model is an effective exploration and important way for realizing the conversion from market 'big data' to market supervision capacity.
The government service hotline center of the industrial and commercial administrative authority 12345 directly faces to mass consumers, is close to people and markets, and can generate a large amount of data information every day. The data information is real-time feedback of vast consumers to the current market problems, can timely and accurately reflect the change characteristics and rules of consumption hotspots, reflect the integrity management conditions, the goods and service quality conditions of market main bodies and the fair market transaction order conditions, is a thermometer and a weather meter of market supervision quality, and is a primary officer for evaluating the market supervision effect of a manager of an industrial and commercial administration.
Therefore, the analysis and utilization of 12345 data are enhanced, especially the risk level classification of the complaint enterprises, the data speaking is adhered to, and the method has great significance for guiding the pertinence and timeliness of law enforcement of related departments.
The traditional calculation method of the consumption right index comprises the following steps: the consumption right-maintaining index can be established as a basic period at a certain period, eight key commodities and services such as clothes, shoes, hats, hairdressing and beauty, household articles, household appliances, computer products, communication equipment, vehicles, agricultural production data and network transaction are selected as monitoring fields, the total amount of the commodities and service related cases reported by the complaints of consumers is calculated, and then a mathematical model is established: the expense right index = the sum of each involved case of eight major categories of goods or services in the month divided by the total amount of the involved cases of the same type of goods or services in the basic period (a certain period) multiplied by 100.
The disadvantages of this algorithm are: the defect 1 is that only the involved amount is considered as a unique index, the calculation result is often influenced by data with high involved amount, and other considerations are lacked: plan-setting proportion, number of persons involved, complaint person proportion, abnormal enterprise complaint proportion and the like; disadvantage 2, due to the huge amount of data involved, adds a number of data dimensions, which brings several difficulties to manual computation.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a method for analyzing consumption right indicators of text mining technology based on machine learning.
The specific scheme is as follows:
a text mining technology consumption right-maintaining index analysis method based on machine learning comprises the following steps:
step 1), data acquisition and processing are carried out to generate normalized data:
s1: acquiring a work order text of a 12345 hot line by using a data acquisition module; performing an initial processing step on the work order text acquired by the data acquisition module by using a processor, and processing the input text data by using a word segmentation module and a denoising module to acquire keywords in each text data; the data acquisition module is connected with the processor, and a word segmentation module and a denoising module are arranged in the processor;
s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by the vectorization normalization result of each keyword W in a file d;
s3: setting a text data similarity matching module, calculating cosine similarity among the keywords by using a cosine law according to the established real numerical matrix, and matching text data with the closest cosine distance to form a word bank of similar words;
s4: setting a normalized data generation module, and generating normalized data from the matched text data according to a set unified normalized mode;
step 2), acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module;
step 3) grouping and weight classifying the consumption right data after the emotional tendency degree analysis by using a classification module;
step 4), utilizing a processor to calculate the index score;
step 5), constructing an SVM classifier;
step 6), constructing a training classification model;
step 7) model prediction is performed by step 6).
Preferably, the specific steps of step 2) obtaining the emotional tendency degree by using the data processing module are as follows:
s1: establishing a conjunctive and negative word dictionary base;
s2: extracting conjunctions and negative words from the normalized data according to the conjunctions and negative words dictionary base, and marking the positions of the corresponding words in the normalized data;
s3: matching the existing emotion dictionary library to obtain the polarity of the vocabulary and the emotion score value thereof;
s4: determining the proportion of the front sentence and the rear sentence according to the position of the conjunctive word, and judging double negation and polarity inversion of adjacent words according to the position of the negative word;
s5: carrying out accumulation on normalized data to obtain emotion calculation scores after words are substituted into conjunctions and negative words by utilizing the polarity of the words and the emotion score values of the words;
s6: looping the steps S2 to S5, if the emotion calculation score is positive, the emotion calculation score is negative, and if the emotion calculation score is negative, the emotion calculation score is positive;
s7: the complaint risk grade rule is used for dividing the complaint risk grade into five risk grades, namely a high risk grade, a general complaint grade, a complaint tendency grade and a non-complaint tendency grade according to the clustering result;
s8: improving the accuracy of the definition result: the rules need to be verified when complaint risk level rules are formulated, the data of a verification set is input into an established classification learning model, the complaint risk level of each text data in the data of the verification set is obtained, a result set is compared with the corresponding level data in the data of the verification set, and the accuracy rate of model prediction belongs to { -1,1}.
Preferably, the step 4) of calculating the index score ratio by using the processor includes the following specific steps:
s1: setting the consumption right index to have m evaluation fields and n experts participating in evaluation, and setting the score set given by one expert k as X i (j))} (k) In the formula { X i (j))} (k) Represents the scoring rank score of the k =1,2 ·, n experts for the i (i =1,2 ·, n) domain, with a value j (j =1,2 ·, m);
s2: according toThe order score set can be converted into a base score set B i (j))} (k) Wherein
According toThe order score set can be converted into a base score set B i(j) } (k) Wherein { B i(j) } (k) The base number score corresponding to the j-th domain of the kth expert is represented;
then, the degree of importance of each research field was calculated using the following formula:
in the formulas (2) and (3), m is in the formulas (2) and (3), and m represents the number of domains; si represents the i-domain score value; n represents the number of experts; b is i(j) The score value of i field arranged at j bit is represented; n is a radical of i Indicating approval of a person in a domain ranked at the jth position
S3: scoring the m items S calculated in S2 i Combined into one m-dimensional vector x = [ a ] 1 ,a 2 ,…,a m ]The vector represents the scoring condition of a complaint work order. Calculating the scoring condition vector of all complaint work orders to form a whole data set Z = { Z = n ,n∈R};
S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of randomly-extracted data in the whole data set after marking the result to form a training data set S train ={(z n ,y n )|z n ∈Z,z n E { -1,1}, n =1,2,3 \8230; } wherein z n The score ratio vector, y, for the nth work order n The fractional results corresponding to the nth order, the remaining 40% being the test data set S test ={(z m ,y m )|z m ∈Z,y m E { -1,1}, m =1,2,3 \8230; } wherein z m Is the score vector of the m-th work order, y m The score is obtained for the mth work order.
Preferably, the specific steps of constructing the SVM classifier in step 5) are as follows:
s1: let a given training set be { (z) 1 ,y 1 ),(z 2 ,y 2 ),…,(z n ,y n ) Wherein z i ∈R n As an input vector, y i E { -1,1} is an output vector, and assuming that the training set can be linearly divided by a hyperplane W · Z + b =0, the problem is converted into an optimization hyperplane problem:
s2: for the case of non-linear divisibility, the low-dimensional input space R can be transformed by a mapping function (called kernel function in SVM) n Mapping to a high-dimensional feature space H to make the linearity separable, the optimization problem is transformed into
S3: solving the optimization function of equation (5) as:
from the minimization problem of equation (5), selecting appropriate functions K (·) and C can determine the SVM classifier;
s4: using RBF radial basis kernel function, i.e. K (Z) i ,Z j )=exp(-γ||Z i -Z j ||) 2 The optimization problem of the classifier is finally transformed into the selection problem of the parameter pair (C, γ).
Preferably, the specific steps of step 6) are as follows:
s1, constructing parameter pairs consisting of all C and gamma in a value range by taking C more than or equal to 1 and less than or equal to 1000 and gamma more than or equal to 0 and less than or equal to 100 as ranges;
s2, sequentially taking parameter pairs (C, gamma) as initial values of parameters of the SVM classifier based on the RBF kernel function, and training a data vector set S train And a set of test data vectors S test Recording the accuracy of prediction on the test data set under the classification model as pt;
s3, aiming at the accuracy rate of the test set, adjusting different C value and gamma value parameter pairs in the SVM test algorithm by using a computer program to ensure that the accuracy rate p is t Reaching the predetermined required accuracy p0 and keeping record of this accuracy p 0 Model parameter pair of (C) 00 ) I.e. this pair of parameters is the SVM score soughtModel parameters of the class model.
Preferably, the specific steps of step 7) are as follows:
s1: constructing a work order score ratio data vector x' without a marking result;
s2: model parameters (C) obtained by training in the third step are adopted 00 ) SVM classification prediction is carried out on x ', and an output result y' epsilon { -1,1 }is obtained
S3: and y' is the calculation result of the consumption right index model of the soft work order emotion degree and the service data.
In order to solve the defects of the algorithm, the invention brings other dimensions related to a consumption right index system into a calculation range, and ensures the accuracy and science of the result by applying a big data classification algorithm and an expert scoring model. The invention provides a consumption right-maintaining index model based on text mining technology, which aims at consumption right-maintaining complainers and complaining units recorded by 12345 work order data, accurately masters hot spot problems complained, analyzes complaint emotional tendency, complaint frequency, service types and the like, and then establishes the consumption right-maintaining index model by utilizing a big data classification algorithm.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flowchart of the present invention for sorting and carding work order data;
FIG. 3 is a flow chart of model verification according to the present invention.
Detailed Description
The present invention is further illustrated in detail by the following examples.
Interpretation of terms:
emotional tendency degree algorithm:
the method is characterized in that the emotional tendency research of the text is an active branch in the current computer field, the intelligent information field and the natural language processing field, and the emotional tendency of the comment object in the text can be analyzed to judge whether the emotional tendency of the comment object is legal or ambiguous. Text emotion tendency analysis is widely applied to the fields of information retrieval, information filtering, emotion recognition and the like.
Big data classification algorithm:
the big data classification algorithm is an algorithm which mainly uses a support vector machine, a neural network and the like to find out common characteristics of a group of data objects in a database and divide the data objects into different classes according to a classification mode. The goal is to map data items in a database into a given category through a classification model.
Support vector machine:
in machine learning, support vector machines (SVMs, and also support vector networks) are supervised learning models associated with associated learning algorithms that can analyze data, identify patterns, and use them for classification and regression analysis. Given a set of training samples, each label belongs to two classes, an SVM training algorithm builds a model, assigns new instances to one class or other classes, and makes them non-probabilistic binary linear classification.
The technical scheme provided by the invention is as follows:
a consumption right index analysis method for a text mining technology based on machine learning comprises the following steps:
step 1), data acquisition and processing are carried out to generate normalized data:
s1: acquiring a work order text of a 12345 hot line by using a data acquisition module; performing an initial processing step on the work order text acquired by the data acquisition module by using a processor, and processing the input text data by using a word segmentation module and a denoising module to acquire keywords in each text data; the system comprises a data acquisition module, a processor, a word segmentation module and a denoising module, wherein the data acquisition module is connected with the processor;
s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by vectorization normalization results of each keyword W in a file d; the vectorization normalization result of the keyword W in the file d is as follows:
<img file="DEST_PATH_GDA0001183237700000021.GIF"wi="700"he="168"img-content="drawing"img-format="GIF"orientation="portrait"inline="no"/>。
s3: setting a text data similarity matching module, calculating cosine similarity among the keywords by using a cosine law according to the established real numerical matrix, and matching text data with the closest cosine distance to form a word bank of similar words;
s4: setting a normalized data generation module, and generating the matched text data into normalized data according to a set unified normalized mode;
step 2) acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module; the method comprises the following specific steps:
s1: establishing a conjunctive and negative word dictionary base;
s2: extracting conjunctions and negative words from the normalized data according to the conjunctions and negative words dictionary base, and marking the positions of the corresponding words in the normalized data;
s3: matching the existing emotion dictionary library to obtain the polarity of the vocabulary and the emotion score value thereof;
s4: determining the proportion of the front sentence and the rear sentence according to the position of the conjunctive word, and judging double negation and polarity inversion of adjacent words according to the position of the negative word;
s5: carrying out accumulation on normalized data to obtain emotion calculation scores after words are substituted into conjunctions and negative words by utilizing the polarity of the words and the emotion score values of the words;
s6: looping the steps S2 to S5, if the emotion calculation score is positive, the emotion calculation score is negative, and if the emotion calculation score is negative, the emotion calculation score is positive;
s7: the complaint risk grade rule is used for dividing the complaint risk grade into five risk grades, namely a high risk grade, a general complaint grade, a complaint tendency grade and a non-complaint tendency grade according to the clustering result;
s8: improving the accuracy of the definition result: and verifying the rules when the complaint risk grade rules are formulated, inputting the data of the verification set into the established classification learning model to obtain the complaint risk grade of each text data in the data of the verification set, comparing the result set with the corresponding grade data in the data of the verification set, and calculating the accuracy rate of model prediction to be E { -1,1}.
Step 3) grouping and weight classifying the consumption right data after the emotional tendency degree analysis by using a classification module;
classifying 12345 hot wires acquired by the data acquisition module through the embedded classification module according to the work order content and the keywords to obtain a secondary index and classifying the important grades in groups; (different sub items in a system are classified, the overall influence on the consumption right index result by the items is divided into three levels, namely, a large influence item, a general influence item and a small influence item) to be classified in groups.
Referring to table 1, table 1 provides a method of classification.
Table 1 is a schematic table of examples provided in 14 evaluation fields.
Step 4), calculating the index score ratio by using a processor;
the method comprises the following specific steps:
s1: setting the consumption right index to have m evaluation fields and n experts participating in evaluation, and setting the score set given by one expert k as X i (j))} (k) In the formula { X i (j))} (k) Represents the scoring rank score of the (i) field of the (k =1,2,) n experts, whose value is j (j =1,2,) m);
s2: according toThe order score set can be converted into a base score set B i (j))} (k) In which
According toThe order score set can be converted into a base score set B i(j) } (k) In which { B } i(j) } (k) The base number score corresponding to the j-th domain of the kth expert is represented; then, the degree of importance of each research field was calculated using the following formula:
in the formulas (2) and (3), m is in the formulas (2) and (3), and m represents the number of domains; s i Representing the i domain score value; n represents the number of experts; b is i(j) Representing the score value of j in the i field; n is a radical of hydrogen i Indicating approval of a person in a domain ranked at the jth position
S3: scoring the m items S calculated in S2 i Combined into one m-dimensional vector x = [ a ] 1 ,a 2 ,…,a m ]The vector represents the scoring of a complaint work order. Calculating the scoring condition vector of all complaint work orders to form a whole data set Z = { Z = n ,n∈R};
S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of data randomly taken out from the whole data set after marking the result to form a training data set S train ={(z n ,y n )|z n ∈Z,z n E { -1,1}, n =1,2,3 \8230 } wherein z n The score ratio vector, y, for the nth work order n The score ratio result corresponding to the nth order is shown in the remaining 40% as the test data set S test ={(z m ,y m )|z m ∈Z,y m E { -1,1}, m =1,2,3 \8230; } wherein z m Is the score vector of the m-th work order, y m The score ratio result corresponding to the m-th work order is shown.
Step 5), constructing an SVM classifier; the method comprises the following specific steps:
s1: let a given training set be { (z) 1 ,y 1 ),(z 2 ,y 2 ),…,(z n ,y n ) Wherein z i ∈R n As an input vector, y i E { -1,1} is an output vector, and assuming that the training set can be linearly divided by a hyperplane W · Z + b =0, the problem is converted into an optimization hyperplane problem:
s2: for the case of non-linear divisibility, the low-dimensional input space R can be transformed by a mapping function (called kernel function in SVM) n Mapping to a high-dimensional feature space H to make the linearity separable, then the optimization problem is transformed into
S3: solving the optimization function of equation (5) as:
from the minimization problem of equation (5), selecting appropriate functions K (-) and C determines the SVM classifier;
s4: using RBF radial basis kernel function, i.e. K (Z) i ,Z j )=exp(-Υ||Z i -Z j ||) 2 Then, thenThe optimization problem of the classifier is finally converted into the selection problem of the parameter pair (C, y).
Step 6), constructing a training classification model; the method comprises the following specific steps:
s1, constructing parameter pairs consisting of all C and gamma in a value range by taking C more than or equal to 1 and less than or equal to 1000 and gamma more than or equal to 0 and less than or equal to 100 as ranges;
s2, sequentially taking parameter pairs (C, gamma) as initial values of parameters of the SVM classifier based on the RBF kernel function, and training a data vector set S train And a set of test data vectors S test Recording the accuracy of prediction on the test data set under the classification model as pt;
s3, aiming at the accuracy rate of the test set, adjusting different C value and gamma value parameter pairs in the SVM test algorithm by using a computer program to ensure that the accuracy rate p is t Reaching the predetermined required accuracy p0 and keeping record of this accuracy p 0 Model parameter pair of (C) 00 ) I.e. this parameter pair is the model parameter of the SVM classification model sought.
Step 7) performing model prediction through the step 6); the method comprises the following specific steps:
s1: constructing a work order score ratio data vector x' without a marking result;
s2: model parameters (C) obtained by training in the third step are adopted 00 ) SVM classification prediction is carried out on the x ', and an output result y' belongs to { -1,1}
S3: and y' is the calculation result of the consumption right index model of the soft work order emotion degree and the service data.
In the traditional mode, the calculation of the consumption right maintaining index greatly depends on the amount data related to the consumption right maintaining, and the data dimension is single and has no practical universality. The invention can process thousands of 12345 work order data by utilizing the existing computer processing speed, classify and analyze the work order contents, combine the latest emotional tendency analyzing and calculating method and the service data calculation, can achieve quick, scientific, standard and objective calculation results, provide consumption right index calculation for law enforcement departments, and provide law enforcement decision assistance for management departments.

Claims (6)

1. A text mining technology consumption right-maintaining index analysis method based on machine learning is characterized by comprising the following steps:
step 1), data acquisition and processing are carried out to generate normalized data:
s1: acquiring a work order text of a 12345 hot line by using a data acquisition module; performing an initial processing step on the work order text acquired by the data acquisition module by using a processor, and processing the input text data by using a word segmentation module and a denoising module to acquire keywords in each text data; the system comprises a data acquisition module, a processor, a word segmentation module and a denoising module, wherein the data acquisition module is connected with the processor;
s2: setting a text feature vectorization module, vectorizing and normalizing each keyword to obtain a real numerical matrix established by vectorization normalization results of each keyword W in a file d;
s3: setting a text data similarity matching module, calculating cosine similarity among the keywords by using a cosine law according to the established real numerical matrix, and matching text data with the closest cosine distance to form a word bank of similar words;
s4: setting a normalized data generation module, and generating normalized data from the matched text data according to a set unified normalized mode;
step 2) acquiring normalized data obtained by the data acquisition module by using the extraction module, and acquiring emotional tendency degree by using the data processing module;
step 3) grouping and weight classifying the consumption right data after the emotional tendency degree analysis by using a classification module;
step 4), calculating the index score ratio by using a processor;
step 5), constructing an SVM classifier;
step 6), constructing a training classification model;
step 7) model prediction is performed by step 6).
2. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,
step 2) the specific steps of obtaining the emotional tendency degree by utilizing the data processing module are as follows:
s1: establishing a conjunctive and negative word dictionary base;
s2: extracting conjunctions and negative words from the normalized data according to the conjunctions and negative words dictionary base, and marking the positions of the corresponding words in the normalized data;
s3: matching the existing emotion dictionary library to obtain the polarity and emotion score value of the vocabulary;
s4: determining the proportion of the front sentence and the rear sentence according to the position of the conjunctive word, and judging double negation and polarity inversion of adjacent words according to the position of the negative word;
s5: substituting the polarity of the vocabulary and the emotion score value thereof into conjunctions and negations, and accumulating the normalized data to obtain emotion calculation score;
s6: the steps S2 to S5 are circulated, if the emotion calculation score is positive, the emotion calculation score is negative, and if the emotion calculation score is negative, the emotion calculation score is central;
s7: the complaint risk grade rule is used for dividing the complaint risk grade into five risk grades, namely a high risk grade, a general complaint grade, a complaint tendency grade and a non-complaint tendency grade according to the clustering result;
s8: improving the accuracy of the definition result: the rules need to be verified when complaint risk level rules are formulated, the data of a verification set is input into an established classification learning model, the complaint risk level of each text data in the data of the verification set is obtained, a result set is compared with the corresponding level data in the data of the verification set, and the accuracy rate of model prediction belongs to { -1,1}.
3. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,
step 4) the specific steps of calculating the index score by using the processor are as follows:
s1: setting a consumption right index, wherein m evaluation fields exist, n experts participate in evaluation, and a special purpose is setThe score set given by family k is X i (j))} (k) In the formula { X i (j))} (k) Representing the rank scores of n experts for the i (i =1,2, \8230;, n) domain with a value of j (j =1,2, \8230;, m);
s2: according toThe order score set can be converted into a base score set B i (j))} (k) Wherein
According toThe order score set can be converted into a base score set B i(j) } (k) Wherein { B i(j) } (k) The base number score corresponding to the j-th domain of the kth expert is represented; then, the importance degree of each research field was calculated by the following formula:
in the formulae (2) and (3), m is in the formulae (2) and (3), and m represents the number of domains; s i Representing the i domain score value; n represents the number of experts; b is i(j) Representing the score value of j in the i field; n is a radical of i Indicating approval of a person in a domain ranked at the jth position
S3: scoring the score of all m items S calculated in S2 i Combined into one m-dimensional vector x = [ a ] 1 ,a 2 ,…,a m ]The vector represents the scoring condition of one complaint work order, and the scoring condition vectors of all the complaint work orders are calculated to form a wholeVolume data set Z = { Z = n ,n∈R};
S4: classifying the data in the full volume data set Z by labels, wherein all the high-score work orders are marked as 1, and the non-high-score work orders are marked as-1; and combining 60% of randomly-extracted data in the whole data set after marking the result to form a training data set S train ={(z n ,y n )|z n ∈Z,z n E { -1,1}, n =1,2,3 \8230; } wherein z n Is the fractional vector of the nth order, y n The fractional results corresponding to the nth order, the remaining 40% being the test data set S test ={(z m ,y m )|z m ∈Z,y m E { -1,1}, m =1,2,3 \8230; } wherein z m Is the score vector of the m-th work order, y m The score is obtained for the mth work order.
4. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,
step 5) the specific steps of constructing the SVM classifier are as follows:
s1: let a given training set be { (z) 1 ,y 1 ),(z 2 ,y 2 ),…,(z n ,y n ) Wherein z i ∈R n As an input vector, y i E { -1,1} is an output vector, and assuming that the training set can be linearly divided by a hyperplane W · Z + b =0, the problem is converted into an optimization hyperplane problem:
s2: for the case of non-linear divisibility, the low-dimensional input space R can be transformed by a mapping function (called kernel function in SVM) n Mapping to a high-dimensional feature space H to make the linearity separable, then the optimization problem is transformed into
S3: solving the optimization function of equation (5) as:
from the minimization problem of equation (5), selecting appropriate functions K (-) and C determines the SVM classifier;
s4: using RBF radial basis kernel function, i.e. K (Z) i ,Z j )=exp(-γ||Z i -Z j ||) 2 The optimization problem of the classifier is finally converted into the selection problem of the parameter pair (C, γ).
5. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,
step 6) comprises the following specific steps:
s1, constructing parameter pairs consisting of all C and gamma in a value range by taking C more than or equal to 1 and less than or equal to 1000 and gamma more than or equal to 0 and less than or equal to 100 as ranges;
s2, sequentially taking parameter pairs (C, gamma) as initial values of parameters of the SVM classifier based on the RBF kernel function, and training a data vector set S train And a set of test data vectors S test Recording the accuracy of prediction on the test data set under the classification model as pt;
s3, aiming at the accuracy rate of the test set, adjusting different C value and gamma value parameter pairs in the SVM test algorithm by using a computer program to ensure that the accuracy rate p is t Reaching the predetermined required accuracy p0 and keeping record of this accuracy p 0 Model parameter pair of (C) 00 ) I.e. this parameter pair is the model parameter of the SVM classification model sought.
6. The machine-learning based text-mining technique consumption right indicator analysis method of claim 1,
step 7) comprises the following specific steps:
s1: constructing a work order score ratio data vector x' without a marking result;
s2: model parameters (C) obtained by training in the third step are adopted 00 ) Performing SVM classification prediction on x 'to obtain an output result y' belonging to { -1,1};
s3: and y' is the calculation result of the consumption right index model of the soft work order emotion degree and the service data.
CN201711366440.9A 2017-12-18 2017-12-18 A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning Pending CN107992613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711366440.9A CN107992613A (en) 2017-12-18 2017-12-18 A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711366440.9A CN107992613A (en) 2017-12-18 2017-12-18 A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning

Publications (1)

Publication Number Publication Date
CN107992613A true CN107992613A (en) 2018-05-04

Family

ID=62038511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711366440.9A Pending CN107992613A (en) 2017-12-18 2017-12-18 A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning

Country Status (1)

Country Link
CN (1) CN107992613A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069777A (en) * 2019-04-03 2019-07-30 网宿科技股份有限公司 A kind of customer service quality auditing method and device
CN111355849A (en) * 2018-12-24 2020-06-30 北京国双科技有限公司 Telephone traffic work order processing method and device
CN111400430A (en) * 2020-03-11 2020-07-10 广联达科技股份有限公司 Method and system for quickly combining prices in digital building list pricing
CN113240556A (en) * 2021-05-31 2021-08-10 平安科技(深圳)有限公司 Infringement processing method, device, equipment and medium based on intelligent decision

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831558A (en) * 2012-07-20 2012-12-19 桂林电子科技大学 System and method for automatically scoring college English compositions independent of manual pre-scoring
CN103455562A (en) * 2013-08-13 2013-12-18 西安建筑科技大学 Text orientation analysis method and product review orientation discriminator on basis of same
US8942470B2 (en) * 2008-12-16 2015-01-27 Microsoft Corporation Sentiment classification using out of domain data
CN105893758A (en) * 2016-04-01 2016-08-24 河北工程大学 WMNT gas explosion disaster risk evaluation method
CN106529804A (en) * 2016-11-09 2017-03-22 国网江苏省电力公司南京供电公司 Client complaint early-warning monitoring analyzing method based on text mining technology
CN106530127A (en) * 2016-11-09 2017-03-22 国网江苏省电力公司南京供电公司 Complaint early warning and monitoring analysis system based on text mining
CN106897792A (en) * 2017-01-10 2017-06-27 广东广业开元科技有限公司 A kind of structural fire protection risk class Forecasting Methodology and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8942470B2 (en) * 2008-12-16 2015-01-27 Microsoft Corporation Sentiment classification using out of domain data
CN102831558A (en) * 2012-07-20 2012-12-19 桂林电子科技大学 System and method for automatically scoring college English compositions independent of manual pre-scoring
CN103455562A (en) * 2013-08-13 2013-12-18 西安建筑科技大学 Text orientation analysis method and product review orientation discriminator on basis of same
CN105893758A (en) * 2016-04-01 2016-08-24 河北工程大学 WMNT gas explosion disaster risk evaluation method
CN106529804A (en) * 2016-11-09 2017-03-22 国网江苏省电力公司南京供电公司 Client complaint early-warning monitoring analyzing method based on text mining technology
CN106530127A (en) * 2016-11-09 2017-03-22 国网江苏省电力公司南京供电公司 Complaint early warning and monitoring analysis system based on text mining
CN106897792A (en) * 2017-01-10 2017-06-27 广东广业开元科技有限公司 A kind of structural fire protection risk class Forecasting Methodology and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林升梁、刘志: "基于RBF核函数的支持向量机参数选择", 《浙江工业大学学报》 *
鞠姗: "贸易保护主义压力下的我国贸易摩擦预警模型", 《山东财政学院学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111355849A (en) * 2018-12-24 2020-06-30 北京国双科技有限公司 Telephone traffic work order processing method and device
CN110069777A (en) * 2019-04-03 2019-07-30 网宿科技股份有限公司 A kind of customer service quality auditing method and device
CN111400430A (en) * 2020-03-11 2020-07-10 广联达科技股份有限公司 Method and system for quickly combining prices in digital building list pricing
CN113240556A (en) * 2021-05-31 2021-08-10 平安科技(深圳)有限公司 Infringement processing method, device, equipment and medium based on intelligent decision
CN113240556B (en) * 2021-05-31 2024-02-09 平安科技(深圳)有限公司 Infringement processing method, device, equipment and medium based on intelligent decision

Similar Documents

Publication Publication Date Title
CN108108352A (en) A kind of enterprise&#39;s complaint risk method for early warning based on machine learning Text Mining Technology
Liu et al. Combining enterprise knowledge graph and news sentiment analysis for stock price prediction
JP2021504789A (en) ESG-based corporate evaluation execution device and its operation method
CN107992613A (en) A kind of Text Mining Technology protection of consumers&#39; rights index analysis method based on machine learning
CN106570525A (en) Method for evaluating online commodity assessment quality based on Bayesian network
Wu et al. User Value Identification Based on Improved RFM Model and K‐Means++ Algorithm for Complex Data Analysis
CN106599065A (en) Food safety online public opinion early warning system based on Storm distributed framework
CN107944761A (en) Early warning and monitoring analysis method is complained based on artificial intelligence protection of consumers&#39; rights index enterprise
Jonathan et al. Sentiment analysis of customer reviews in zomato bangalore restaurants using random forest classifier
Kashti et al. Enhancing NLP techniques for fake review detection
Darena et al. Machine learning-based analysis of the association between online texts and stock price movements
CN111626331B (en) Automatic industry classification device and working method thereof
Alamsyah et al. Analyzing employee voice using real-time feedback
CN117114812A (en) Financial product recommendation method and device for enterprises
CN116385151A (en) Method and computing device for risk rating prediction based on big data
CN112506930B (en) Data insight system based on machine learning technology
Hai Data mining of enterprise financial management based on AHP
Qian et al. Satiindicator: Leveraging user reviews to evaluate user satisfaction of sourceforge projects
Reddy et al. Classification of Hotel Reviews using Machine Learning Techniques
Nadali et al. Class Labeling of Bank Credit's Customers Using AHP and SAW for Credit Scoring with Data Mining Algorithms
Zhu et al. Identification of Impoverished College Students Based on LR-RF Dual Indicators
Zhang et al. Construction of portrait system of listed companies based on big data
Masuda et al. A comparative study using discriminant analysis on a questionnaire survey regarding project managers’ cognition and team characteristics
Urkude et al. Comparative analysis on machine learning techniques: a case study on Amazon product
An et al. Multi-Attribute Classification of Text Documents as a Tool for Ranking and Categorization of Educational Innovation Projects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180504