CN108021704B

CN108021704B - Agent optimal configuration method based on social public opinion data mining technology

Info

Publication number: CN108021704B
Application number: CN201711445217.3A
Authority: CN
Inventors: 孔祥明; 杨晓霖
Original assignee: Guangdong Guangye Kaiyuan Technology Co ltd
Current assignee: Guangdong Guangye Kaiyuan Technology Co ltd
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2021-05-04
Anticipated expiration: 2037-12-27
Also published as: CN108021704A

Abstract

The invention discloses an agent optimal configuration method based on social public sentiment data mining technology, which comprises the following steps: step1, collecting public opinion data by using a web crawler technology; step2, public opinion data preprocessing, including data cleaning, data integration, data conversion and data reduction; step3, establishing a social public opinion confidence model by using a text mining technology and a support vector machine algorithm, and dividing the social public opinion information confidence level; and 4, establishing an agent optimization configuration model. According to the agent optimal configuration algorithm model based on social public opinion data mining, provided by the invention, the social public opinions are divided into confidence levels according to real-time public opinion data of the Internet and algorithms such as text mining and SVM (support vector machine) and the like, so that the call volume reported by 12345 complaints is estimated and predicted, and the agent is scientifically and reasonably configured by using a big data technology.

Description

Agent optimal configuration method based on social public opinion data mining technology

Technical Field

The invention relates to the technical field of databases, in particular to an agent optimal configuration method based on social public sentiment data mining technology.

Background

With the steady development of politics, economy, culture and society in China, the right-maintaining consciousness of people is gradually enhanced, the attention to social affairs is continuously improved, and the 12345 government affair service hotline becomes an effective window for reflecting social problem phenomena and expressing social appeal of people. In order to effectively exert the influence and the acting force of a 12345 government affair service hotline, the reasonable arrangement of the seats is a basic work which cannot be ignored, and the reasonable configuration of the seats is the basis and the key for the masses to effectively express complaints and reflect problems in time.

The existing seat configuration model only sets seats based on historical data such as call quantity, average processing time and the like, the seat model considers single factors, and the phenomenon that the seat configuration is unreasonable is easily caused by neglecting social public opinions which are closely related to the number of complaints of the masses. With the high-speed transmission of internet information, the relevance between the real-time public sentiment data of the internet and the complaint reporting information is continuously enhanced, and the mining of the social public sentiment data can provide stronger leading significance for the optimal configuration of the seat.

Disclosure of Invention

In view of the above defects in the prior art, the technical problem to be solved by the present invention is to provide an agent optimal configuration method based on social public opinion data mining technology, which performs confidence level division on social public opinions according to real-time public opinion data of the internet and algorithms such as text mining and SVM, evaluates and predicts telephone incoming call volume and hot-line average processing duration of 12345 complaint reporting, and provides effective data support for scientific and reasonable configuration of an agent by using big data technology.

In order to achieve the purpose, the invention provides an agent optimal configuration method based on social public sentiment data mining technology, which comprises the following steps:

step1, collecting public opinion data by using a web crawler technology;

step2, public opinion data preprocessing, including data cleaning, data integration, data conversion and data reduction;

step3, establishing a social public opinion confidence model by using a text mining technology and a support vector machine algorithm, and dividing the social public opinion information confidence level;

and 4, establishing an agent optimization configuration model.

Further, the step2 specifically includes:

step 21: data cleaning, namely identifying and processing the vacant data, the incomplete data and the unreasonable data;

step 22: data integration, namely organically centralizing and integrating data of different sources, formats and characteristics;

step 23: data conversion, converting the format of the data;

step 24: and (4) data reduction, namely, on the premise of keeping the integrity of the data, simplifying the data.

Further, the step3 specifically includes:

step 31: manually labeling, namely randomly extracting texts in a certain proportion, performing labeling classification by a plurality of related professionals, and counting the consistency of corpus labeling according to the labeling result;

step 32: feature selection, wherein the feature selection refers to selecting some representative words from a dictionary to realize dimension reduction, a Chi method is adopted to perform feature selection, and feature words w and categories a are assumed_kThe chi-square distribution of the first-order degree of freedom is satisfied between the feature words w and the class a_kThe chi-square formula of (c) is:

n1 is the total number of documents, A is belonging to a_kThe number of documents in class and containing the feature word w, B being not a_kThe number of documents in class but containing the feature word w, E being a_kClass but no number of documents containing feature word w, D being not a_kThe number of documents which are similar and do not contain the feature word w;

for the situation of multiple categories, calculating chi-square statistic of the feature word w under each category;

if the feature word w and the category a_kIf the chi-square statistic value of (a) is 0, the feature word w and the text category a are described_kAre independent of each other; if the chi-square statistic value is larger, the characteristic word w and the category a are explained_kThe stronger the correlation of (c); removing the features lower than a specific threshold value through a chi-square formula, and reserving the features higher than the threshold value to realize feature selection;

step 33: feature extraction, namely, mapping a high-dimensional space to a low-dimensional space to realize dimension reduction, and performing feature extraction on a text by using an LSA algorithm, wherein the method mainly comprises the following steps:

1) establishing a word frequency matrix M;

2) calculating singular value decomposition of a word frequency matrix M, and decomposing the M into U, S, V three matrixes, wherein U and V are orthogonal matrixes, and S is a diagonal matrix;

3) mapping other training samples into a U space;

4) indexing and calculating similarity of the converted documents, and obtaining an LSA classifier through training;

step 34: constructing feature vectors, converting each text into an n-dimensional text vector, forming a text vector space by a plurality of text vectors, and assuming that n feature words exist, each text is an n-dimensional vector after being represented by the text;

step 35: constructing SVM classifier, and setting { (x 1)_,y₁),(x2_,y₂),…,(xn_,y_n) The problem is transformed to the optimized hyperplane problem if the training set can be linearly divided by a hyperplane W · X + b { -0 } for a training set where xi represents the input vector and yi ∈ { -1,1} represents the output vector:

if the linear division is not linear, the input space R with low dimension can be divided by the kernel function K (x1, x2)ⁿMapping to a high-dimensional characteristic space H to realize linear divisibility, and selecting a polynomial kernel function, wherein the formula is as follows:

K(x1，x2)＝(<x1，x2>+R)^d

1) selecting a proper kernel function K (x1, x2) and a penalty coefficient C >0, the formula of the objective function is as follows:

2) calculating an ai vector corresponding to minimization of the formula (2) by using an SMO algorithm;

3) calculating w, the formula is

4) Finding all samples (xm, ym) that meet 0< ai < C, assuming a total of M support vectors;

5) by passing

Calculating bm corresponding to each support vector (xm, ym), and obtaining

6) Thus, the classification hyperplane is

The classification decision function is

Step 36: and (5) predicting a result by the model.

Further, the step4 specifically includes:

step 41, according to the social public opinion information confidence level classification result in the step3, combining with the recent historical data of 12345 service hotlines, drawing an analysis curve related to time, and performing relevance analysis on hotline incoming call quantity and average processing time of each day and different time periods by using a relevance analysis algorithm;

step 42, by using a multiple regression analysis algorithm, taking historical data of social public opinion confidence level X1, and 12345 complaint report number X2, work order item type X3, and work order severity X4 as input variables, realizing weight distribution through multiple fitting, and finally constructing a daily hot-line incoming call quantity calculation formula in the following form:

F₁(X)＝W1*f₁(X1)+W2*f₂(X2)+W3*f₃(X3)+W4*f₄(X4)

hotline call volume F in different periods₂(X) and hotline average processing time period F₃(X) is calculated in a similar manner;

step 41, constructing an agent optimization configuration model by utilizing a multiple regression analysis algorithm, and assuming that the hot line incoming call quantity per day is F₁(X) the hot incoming call quantity in different periods is F₂(X) average hot line processing time length F₃(X) if the call completing rate is a and the maximum occupancy rate is j, the agent optimization configuration function is as follows: g (X, a, j) ═ U1 × F₁(X)+U2*F₂(X)+U3*F₃(X) + U4 g (a) + U5 h (j) Ui represents a weight.

Further, the step3 divides the social public opinion information into five confidence levels of optimism, prudent optimism, neutrality, prudent pessimism and pessimism.

The invention has the beneficial effects that:

according to the agent optimal configuration algorithm model based on social public opinion data mining, provided by the invention, the social public opinions are divided into confidence levels according to real-time public opinion data of the Internet and algorithms such as text mining and SVM (support vector machine) and the like, so that the call volume reported by 12345 complaints is estimated and predicted, and the agent is scientifically and reasonably configured by using a big data technology.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a flow chart of establishing a social public opinion confidence model according to the present invention.

FIG. 3 is a flow chart of agent optimization configuration model establishment according to the present invention.

Detailed Description

As shown in fig. 1, the method for optimal configuration of an agent based on social public sentiment data mining technology of the present invention specifically comprises the following operation steps:

the method comprises the following steps: public opinion data collection

The method comprises the steps of collecting social public opinion data on the Internet by utilizing a web crawler technology, for example, regularly crawling and collecting the social public opinion data on social media such as various news websites, microblogs, forums, blogs and the like, and mainly using unstructured data mainly comprising text information.

Step two: public opinion data preprocessing

And preprocessing the crawled public opinion data, including the steps of data cleaning, data integration, data conversion, data reduction and the like.

Step 1: data cleaning: and the method identifies and processes the blank data, the incomplete data and the unreasonable data, and ensures the integrity, the reasonability, the authority and the consistency of the data.

Step 2: data integration: the data of different sources, formats and characteristics are organically centralized and integrated.

Step 3: data conversion: the format of the data is converted, so that the data can be analyzed and mined conveniently in the follow-up process.

Step 4: and (3) data reduction: on the premise of keeping the integrity of the data as much as possible, the data is simplified and processed by common dimensionality reduction methods such as PCA (principal component analysis).

Step three: the social public opinion confidence model is shown in fig. 2:

a social public opinion confidence model is established by utilizing a text mining technology and a support vector machine algorithm, and public opinion information is divided into five confidence levels of optimism, judicious optimism, neutrality, judicious pessimism and pessimism.

Step 1: manual labeling: randomly extracting texts in a certain proportion, carrying out labeling classification by a plurality of related professionals, counting the consistency of corpus labeling according to the labeling result, and using the passed labeling for information classification.

Step 2: selecting characteristics: the feature selection refers to selecting some representative words from a dictionary to realize dimension reduction.

Selecting characteristics by using a Chi method, and assuming characteristic words w and categories a_kThe chi-square distribution of the first-order degree of freedom is satisfied between the feature words w and the class a_kThe chi-square formula of (c) is:

n1 is the total number of documents, A is belonging to a_kThe number of documents in class and containing the feature word w, B being not a_kThe number of documents in class but containing the feature word w, E being a_kClass but no number of documents containing feature word w, D being not a_kClass and number of documents without the feature word w.

For the case of multiple categories, it is necessary to calculate chi-square statistics of the feature word w under each category.

If the feature word w and the category a_kIf the chi-square statistic value of (a) is 0, the feature word w and the text category a are described_kAre independent of each other; if the chi-square statistic value is larger, the characteristic word w and the category a are explained_kIn (2) correlation ofThe stronger the sex. Through a chi-square formula, the features lower than a specific threshold value can be removed, the features higher than the threshold value are reserved, and feature selection is realized.

Step 3: characteristic extraction: dimension reduction is achieved by mapping the high-dimensional space to the low-dimensional space.

The method is characterized by comprising the following steps of applying an LSA algorithm to extract the features of a text:

1) establishing a word frequency matrix M;

3) mapping other training samples into a U space;

4) and indexing and calculating the similarity of the converted documents, and obtaining the LSA classifier through training.

Step 4: constructing a feature vector: each text is converted into an n-dimensional text vector, and a plurality of text vectors form a text vector space. Assuming that n feature words are provided, each text is an n-dimensional vector after being represented by the text.

Step 5: constructing SVM classifier, and setting { (x 1)_,y₁),(x2_,y₂),…,(xn_,y_n) The problem is transformed to the optimized hyperplane problem if the training set can be linearly divided by a hyperplane W · X + b { -0 } for a training set where xi represents the input vector and yi ∈ { -1,1} represents the output vector:

if the linear division is not linear, the input space R with low dimension can be divided by the kernel function K (x1, x2)ⁿMapping to a high-dimensional feature space H to realize linear divisibility.

The kernel function refers to an inner product function of two vectors in a space after implicit mapping, common kernel functions include a polynomial kernel function, a linear kernel function, a gaussian kernel function and the like, and the polynomial kernel function is selected herein, and the formula is as follows:

K(x1，x2)＝(<x1，x2>+R)^d

2) calculating a corresponding a vector when the formula (2) is minimized by using an SMO algorithm;

3) calculating w, the formula is

5) by passing

Calculating bm corresponding to each support vector (xm, ym), and obtaining

6) Thus, the classification hyperplane is

The classification decision function is

Step 6: model prediction results

Confidence level classification is carried out on unmarked social public opinion data by adopting SVM parameters and models trained in step5

Step four: the agent optimization configuration model is as shown in FIG. 3:

step 1: and (4) according to the social public opinion confidence level classification result in the step three, combining the recent historical data of the 12345 service hotline, drawing an analysis curve related to time, and performing relevance analysis on the hotline incoming call quantity and the average processing time of each day and different time intervals by using a relevance analysis algorithm.

Step 2: by utilizing a multiple regression analysis algorithm, taking historical data such as social public opinion confidence level X1, number of complaints reported by 12345X 2, work order item type X3, work order severity X4 and the like as input variables, realizing weight distribution through multiple fitting, and finally constructing a hot line incoming call quantity calculation formula in the following form each day:

F₁(X)＝W1*f₁(X1)+W2*f₂(X2)+W3*f₃(X3)+W4*f₄(X4)

hotline call volume F in different periods₂(X) and hotline average processing time period F₃The calculation method of (X) is similar.

Step 3: and constructing an agent optimization configuration model by using a multiple regression analysis algorithm. Suppose that the daily hot line incoming call amount is F₁(X) the hot incoming call quantity in different periods is F₂(X) average hot line processing time length F₃(X) if the call completing rate is a and the maximum occupancy rate is j, the agent optimization configuration function is as follows: g (X, a, j) ═ U1 × F₁(X)+U2*F₂(X)+U3*F₃(X) + U4 g (a) + U5 h (j) Ui represents a weight.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A method for optimally configuring an agent based on a social public sentiment data mining technology is characterized by comprising the following steps:

step1, collecting public opinion data by using a web crawler technology;

step3, establishing a social public opinion confidence model by using a text mining technology and a support vector machine algorithm, and dividing the social public opinion information confidence level; the step3 divides the social public sentiment information into five confidence levels of optimism, judicious optimism, neutrality, judicious pessimism and pessimism,

step4, establishing an agent optimization configuration model, wherein the step4 specifically comprises the following steps:

F₁(X)＝W1*f₁(X1)+W2*f₂(X2)+W3*f₃(X3)+W4*f₄(X4)

step 43, constructing an agent optimization configuration model by utilizing a multiple regression analysis algorithm, and assuming that the hot line incoming call quantity per day is F₁(X) the hot incoming call quantity in different periods is F₂(X) average hot line processing time length F₃(X) if the call completing rate is a and the maximum occupancy rate is j, the agent optimization configuration function is as follows: g (X, a, j) ═ U1 × F₁(X)+U2*F₂(X)+U3*F₃(X) + U4 g (a) + U5 h (j) Ui represents a weight.

2. The method for optimal configuration of the agent based on the social public opinion data mining technology as claimed in claim 1, wherein the step2 specifically comprises:

step 23: data conversion, converting the format of the data;