CN109408574B - Complaint responsibility confirmation system based on text mining technology - Google Patents

Complaint responsibility confirmation system based on text mining technology Download PDF

Info

Publication number
CN109408574B
CN109408574B CN201811211967.9A CN201811211967A CN109408574B CN 109408574 B CN109408574 B CN 109408574B CN 201811211967 A CN201811211967 A CN 201811211967A CN 109408574 B CN109408574 B CN 109408574B
Authority
CN
China
Prior art keywords
data
service
layer
model
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811211967.9A
Other languages
Chinese (zh)
Other versions
CN109408574A (en
Inventor
李静
张雪梅
孟巍
吴雪霞
梁雅洁
徐美玲
王若晗
刘思涛
孙峰
孙云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Marketing Service Center of State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201811211967.9A priority Critical patent/CN109408574B/en
Publication of CN109408574A publication Critical patent/CN109408574A/en
Application granted granted Critical
Publication of CN109408574B publication Critical patent/CN109408574B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • G06Q30/015Providing customer assistance, e.g. assisting a customer within a business location or via helpdesk
    • G06Q30/016After-sales
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Multimedia (AREA)
  • Primary Health Care (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a complaint responsibility confirmation system based on a text mining technology, which comprises a data layer, a platform layer, a middleware, a service layer and a service layer; the data layer comprises a relational database, a distributed file system and a non-relational database, the data layer periodically extracts data from a 95598 business support system and a marketing business application system of the national network headquarter, and the data are classified and stored; the platform layer comprises a public service component, a service flow engine component and a Spark; the middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in, wherein the image recognition algorithm plug-in and the decision tree algorithm plug-in calculate the data of the data layer and model in the modes of data modeling, data mining and the like to realize the analysis and mining of the data; the service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service; the business layer comprises responsibility identification, image recognition and work order automatic generation.

Description

Complaint responsibility confirmation system based on text mining technology
Technical Field
The invention relates to a complaint responsibility confirmation system based on a text mining technology.
Background
The company intensively constructs nationwide large-scale intensive customer service centers on the headquarters level, unifies service standards on the whole network, remodels a new customer service system, comprehensively optimizes service resources, constructs a high-efficiency cooperation mechanism with longitudinal through and transverse fusion, and makes full pass through the last kilometer of customer service. In order to effectively play a customer service monitoring role, transmit working pressure layer by layer, improve the attention degree of each level to service and promote the quick improvement of the company service level, the customer complaint work order needs to be managed in a closed loop mode, and complaint responsibility main bodies are determined.
At present, 95598 complaint responsibility identification is mainly performed in a wiring lower process, information and data are transmitted by means of telephones and OA, workers manually record results by using tables, problems of information omission, data loss, working overtime and the like easily occur, difficulty exists in viewing historical tracks and performing report statistics, and high-efficiency and standard management of responsibility identification work is not facilitated.
Disclosure of Invention
The invention aims to provide a complaint responsibility confirmation system based on text mining technology, which can automatically generate a responsibility confirmation work order.
In order to solve the technical problem, the invention comprises a data layer, a platform layer, a middleware, a service layer and a service layer;
the data layer comprises a relational database, a distributed file system and a non-relational database, the data layer periodically extracts data from a 95598 business support system and a marketing business application system of the national network headquarters, the data are stored in the relational database, the distributed file system and the non-relational database in a classified manner, the data are calculated through an image recognition technology and a decision tree processing technology, modeling is carried out through a data modeling and data mining mode, analysis and mining of the data are realized, and the application of a responsibility determination function module in the support system is realized;
the platform layer comprises a public service component, a service flow engine component and a Spark in the marketing development platform;
the middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in;
the service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service;
the business layer comprises responsibility identification, image identification and work order automatic generation;
the image recognition technology compares the declared video image data by analyzing the declared data on the user line, and associates the declared video image data with the related video data and the responsibility confirmation work order;
data integration: real-time data distributed information and a Hadoop offline data extraction technology are fused, rapid access of unstructured data is achieved, and a distributed data integration function is constructed;
data storage: providing relational data storage, unstructured data storage and distributed file storage, and providing a uniform storage access interface;
and (3) data calculation: the platform layer analyzes and stores unstructured files by using batch calculation and flow calculation, classifies and translates data by using spark distributed technology and combining a picture recognition model and a decision tree analysis model, and places the translated data into a multi-thread queue for parallel distributed calculation tasks;
and (3) data analysis: extracting information in data from various aspects by using an S-PLUS language, extracting data in distributed and heterogeneous data sources such as relational data and plane data files to a temporary middle layer by using an ETL tool, cleaning, converting and integrating the data, and finally loading the data to a data warehouse or a data mart to become a basis for online analysis processing and data mining;
and finally, constructing a responsibility confirmation automatic generation work order model through a decision tree algorithm.
The common service component comprises unified coding, unified cache, unified verification rules, a plug-in communication mode and RPC based on Rest.
The image recognition technology adopts a BRISK image recognition algorithm, and uses detection random response to establish a descriptor to generate a binary descriptor.
The BRISK image recognition algorithm is realized by the following steps:
(1) detecting characteristic points of the scale space: the image and scale dimensions are all characterized in that interested feature points are identified by using a significance standard, in order to improve the calculation efficiency, feature points are detected in the middle of layers of an image pyramid, and the position and the scale of each key point are obtained by fitting a quadratic function in a continuous region;
(2) and (3) key point detection: the sample pattern of points lies on a well-scaled concentric circle, which is used to detect gray values at adjacent locations of each keypoint: determining the direction of the feature description with respect to the intensity gradient of the treatment;
the BRISK binary descriptors are all composed of binary strings through indirect simple brightness comparison test results, and the BRISK determines the characteristic direction of each characteristic point so as to obtain direction equalization descriptors;
firstly, the key of BRISK description is to utilize the mode used by the adjacent positions of the pixel collection key points, take the key points as the center, collect the circles of N characteristic points around the key points, and define a plurality of equal local circular areas;
applying a Gaussian smoothing method to sampling points Pi in the mode, enabling the standard deviation delta i to be proportional to the distance of each sampling point corresponding to the respective center, correspondingly patterning the positioning and expanding mode as key points k in the image, considering an N (N-1)/2 sampling point pairs, and expressing the sampling point pairs by using a set (Pi, Pj);
③ the smoothed pixel values of these points are I (Pi, σ I) and I (Pj, σ j), respectively, the formula used to estimate the local gradient value g (Pi, Pj) is:
Figure GDA0003067546410000031
the set of all the combination modes is called a sampling point pair and is expressed as the following set:
Figure GDA0003067546410000032
defining short-distance point pair subset S and long-distance point pair subset L (L) as follows:
Figure GDA0003067546410000033
Figure GDA0003067546410000034
sixthly, the main direction of the characteristic point k is calculated as follows:
Figure GDA0003067546410000035
α=arctan2(gy,gx)
(3) the sampling mode facing to the BRISK is used for obtaining paired brightness comparison results and combining the results into a binary BRISK descriptor;
using a rotation angle α of a sampling point around a key point k, which is ARCTAN2(gy, gx), as a mode;
the descriptor of the BIRSK is also a vector containing 512 bits, each descriptor is generated by comparing two pairs of (P α i, P α j) epsilon S by a short-distance point pair, the superscript alpha represents a rotation mode, and each bit b corresponds to:
Figure GDA0003067546410000036
Figure GDA0003067546410000037
③ matching two BRISK descriptors is a simple calculation of their Hamming distance in BRIEF: the bit number is different, the two descriptors are different in weight, and the operation of bitwise operation is reduced by bit counting;
and fourthly, extracting a comparison structure, and matching and associating the result of the image with a responsibility work order in the system after big data operation and comparison, so that business personnel can check the result.
The system adopts a text processing method based on natural language to process grammar, and uses a conditional random field model to extract a keyword model, and the method comprises the following specific steps:
(1) in the first step of modeling, the model basic data maintenance function needs to load some pre-training word embedding, and simultaneously, the meanings of some key words are extracted from characters;
(2) it is necessary to get a meaningful representation of each word in the context using LSTM;
(3) the keyword extraction model algorithm is established as a binary problem, whether words or phrases in the documents are keywords or not is judged, since the problem is the classification problem, labeled training corpora need to be provided, the keyword extraction model is trained by utilizing the training corpora, and the keywords of the documents needing to be extracted are extracted according to the model.
The algorithm of the keyword extraction model is a linear chain CRF, in the linear chain CRF, features only depend on the current label and the previous label, but not on any label in the whole sentence, and in order to model the behavior, a feature function is used, and the function comprises a plurality of input values:
sentence s
The position li of the word in the sentence
The label l _ i of the current word
The label li-1 of the previous word
Next, each feature function f _ j is given a weight λ _ j. Given a sentence s, we can now calculate the label/of s according to: the weighted features of all words in the sentence are summed,
Figure GDA0003067546410000041
finally, we can convert these scores into probabilities p (l | s) between 0 and 1 by taking an index and normalizing,
Figure GDA0003067546410000042
the decision tree processing technology is modeled by using a decision tree model, and the calculation method of the decision tree model adopts a TF-IDF method for calculation. Assuming that a certain node S contains S samples and m classes, each of the classes is corresponding to Ci, i belongs to {1,2, …, m } Ci, i belongs to {1,2, …, m }, and the number of samples in each class is si, the amount of information (entropy) required for determining the class of any one sample in the node is:
Figure GDA0003067546410000051
wherein pi is a probability that any sample belongs to the class Ci, and thus pi is si/S, and it is assumed that the feature a has v different values, { a1, a2,. and.. av }, then the node sample can be divided into v subsets { S1, S2,. and Sv } by using the feature a, Sj includes a sample set in which the feature a in the set S takes the aj value, the corresponding number of samples is Sj, and it is assumed that Sij is a sample set in the subset Sj that belongs to the class Ci, and the corresponding number of samples is Sij, then the entropy required for determining any sample class in the subset Sj is:
Figure GDA0003067546410000052
the information entropy required for class determination of samples in all subsets should be a weighted average of the information entropy required for class determination of samples in each subset alone:
Figure GDA0003067546410000053
in the above formula
Figure GDA0003067546410000054
I.e. the weight of the jth subset,
the information gain of using the feature a to divide the sample of the current node into subsets is as follows:
Gain(A)=I(s1,s2,...,sm)-E(A)
selecting the characteristic with the maximum value of gain (A), namely E (A) > minimum, namely, performing subset division on the node samples by using the test characteristic selected by the method can minimize the information entropy weighted average value required for determining any sample category in the divided subsets; the information entropy reflects the uncertainty of the sample, the larger the information entropy is, the higher the uncertainty is, the smaller the information entropy is, and the lower the uncertainty is; that is, the test features selected using this method will make it easier to subsequently determine the sample class in the subset, thereby making the overall decision tree simpler.
After the structure is adopted, the system utilizes the information of the user information, organization structure information, complaints and the like of the marketing service application system of the power company, combines the video information acquired by the video monitoring system of 95598 and the file information acquired by the structured data platform, utilizes the image recognition and file retrieval technology, analyzes and forms typical videos and image cases by constructing an image recognition model and a text full-text retrieval model, realizes the scenes of supervision and management of complaint events of the user, complaint responsibility implementation processing, comprehensive analysis and the like, and provides decision support for the improvement of power supply service quality and service image of the company. The system intelligently analyzes the customer complaint data and automatically matches the associated system data under the image recognition and decision tree model technology, the complaint responsibility confirmation is developed on line, the complaint responsibility confirmation period is shortened, the work order can be timely processed, the problem that the complaint work order confirmation workload is large, the communication period of the confirmation statistic personnel is long, the report analysis of the confirmation result is complex, the confirmation process is not standard and the like is solved, the efficiency of complaint responsibility confirmation is effectively improved, the limitation of traditional off-line circulation is solved, and the human resources of a company are saved. The system can automatically identify the complaint data needing to be subjected to responsibility confirmation, accurately position the minimum complaint responsibility confirmation department, analyze and mine the confirmation result data, is favorable for standardizing the complaint investigation and processing work of customers, and improves the power supply service capacity and the high-quality service level.
Drawings
The invention is described in further detail below with reference to the following figures and embodiments:
fig. 1 is an architecture diagram of the present system.
Detailed Description
The complaint responsibility confirmation system based on the text mining technology is developed based on a J2EE enterprise-level application framework platform, the whole technical framework adopts an SOA service-oriented management framework mode, a B/S system framework is adopted, and the design concept of service support is provided in a layered mode, so that the effective integration of all application components is realized, and the service is provided for users in a WEB mode. As shown in fig. 1, the complaint responsibility confirmation system based on the text mining technology includes a data layer, a platform layer, middleware, a service layer, and a business layer. The data layer comprises a relational database, a distributed file system and a non-relational database, and the data source of the system mainly comprises 95598 business support system, user file data, organization structure data and other external data in a national network marketing business application system. The data layer periodically extracts data from the data source, classifies and stores the data in a relational database, a distributed file system and a non-relational database, calculates the data of the data layer through an image recognition technology and a decision tree algorithm technology, and models through data modeling, data mining and other modes to realize the analysis and mining of the data. The platform layer comprises a public business component, a business flow engine component and a Spark (big data cloud computing platform) in the marketing development platform. The common service components comprise unified coding, unified caching, unified verification rules, a plug-in communication mode and RPC based on Rest. The middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in. The service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service. The business layer comprises responsibility identification, image recognition and work order automatic generation.
The system processes the grammar by adopting a text processing method based on natural language, extracts key information from a large text segment input by a user in the system by using a keyword extraction model algorithm, provides the service personnel for checking the key information in the text, and is convenient for the service personnel to operate. The goal of natural language based processing of text is to label each word in the text as one of four categories: power services, individuals, miscellaneous, and others, and then find the most prominent power service and name in the text. The above labeling is done for each word using a keyword extraction classification, and then the keyword extraction model filters out unwanted labels using a rule-based approach and determines the most prominent power business and name. The method for processing the text based on the natural language uses a conditional random field model to extract a keyword model, and comprises the following specific steps:
(1) in the first step of modeling, the model basic data maintenance function needs to load some pre-training word embedding, and simultaneously, the meanings of some key words are extracted from characters;
(2) it is necessary to get a meaningful representation of each word in the context using LSTM;
(3) the keyword extraction model algorithm is established as a binary problem, and whether a word or a phrase in a document is a keyword or not is judged. Since the problem is the classification problem, it is necessary to provide the labeled corpus, train the keyword extraction model using the corpus, and extract the keywords from the documents whose keywords are to be extracted according to the model.
In the keyword extraction model algorithm, our input data is sequence data, and at the same time, when prediction is performed on a certain data point, the context of the previous text needs to be considered. The algorithm of the keyword extraction model in the system is a linear chain CRF, and in the linear chain CRF, the characteristics only depend on the current label and the previous label, but not on any label in the whole sentence. To model this behavior, we will use a feature function that contains a number of input values:
sentence s
The position li of the word in the sentence
The label l _ i of the current word
The label li-1 of the previous word
Next, each feature function f _ j is given a weight λ _ j. Given a sentence s, we can now calculate the label/of s according to: the weighted features of all words in the sentence are summed.
Figure GDA0003067546410000071
Part-of-speech tagging-based feature function examples
If l _ i ═ ADVERB and the ith word ends with "" -ly ″, then f _1(s, i, l _ i, l _ i-1) ═ 1, else 0 is taken. If the corresponding weight λ 1 is positive and very large, this feature basically means that we tend to label words ending with "-ly" as ADVERB.
If i is 1, l _ i is VERB and the sentence ends with a question mark, then f _2(s, i, l _ i, l _ i-1) is 1, otherwise 0 is taken. If the corresponding weight λ 2 is positive and very large, this feature essentially indicates that we tend to label the first word of a question sentence as VERB. (e.g., "Is this a content marking with a verbs
If l _ i-1 is ADJECTIVE and l _ i is NOUN, then f _3(s, i, l _ i, l _ i-1) is 1, otherwise it is 0. A positive correspondence weight means that we tend to consider nouns to follow adjectives.
If l _ i-1 and l _ i are previous, then f _4(s, i, l _ i, l _ i-1) is 1. The weight λ 4 for this function is negative, indicating that a preposition should not follow another preposition, so we should avoid such labeling.
Finally, we can convert these scores into probabilities p (l | s) between 0 and 1 by taking the index and normalizing.
Figure GDA0003067546410000081
Establishing a keyword extraction model algorithm, a group of characteristic functions (which can depend on the whole sentence, the current position of a word and the label of a nearby word) are defined, weights are given, and then the weights are added up, and finally, if necessary, the probability form can be converted.
The system extracts the relevant documents of the user from a marketing business system, a 95598 customer service system of the national network headquarters and other business systems of customer service, utilizes a conditional random field model algorithm to count the text content of a responsibility work order, extracts key business information, automatically generates the work order, and provides power company business personnel with responsibility for judging whether the responsibility is generated, so that the workload of the business personnel can be reduced.
The image recognition technology in the system adopts a BRISK image recognition algorithm, and uses detection random response to establish a descriptor so as to generate a binary descriptor.
The BRISK image recognition algorithm is realized by the following steps:
(1) detecting characteristic points of the scale space: both the image and scale dimensions are used to identify the feature points of interest by using a criterion of significance. In order to improve the calculation efficiency, feature points are detected in the middle of layers of the image pyramid. Obtaining the position and scale of each key point by fitting a quadratic function to the continuous region;
(2) and (3) key point detection: the sample pattern of points lies on a well-scaled concentric circle, which is used to detect gray values at adjacent locations of each keypoint: determining the direction of the feature description with respect to the intensity gradient of the treatment;
BRISK binary descriptors are all composed of binary strings through indirect simple luminance comparison test results. The BRISK determines the characteristic direction of each characteristic point so as to obtain a descriptor of direction equalization;
the key of BRISK is to use the mode of pixel to collect the adjacent position of key point. The key point is used as the center, and the circles of N characteristic points are collected around the key point to define a plurality of equal local circular areas.
Secondly, applying a Gaussian smoothing method to sampling points Pi in the mode, enabling the standard deviation delta i to be proportional to the distance of each sampling point corresponding to the center of each sampling point, correspondingly patterning the positioning and expanding mode for key points k in the image, considering one N (N-1)/2 sampling point pairs, and expressing the sampling point pairs by a set (Pi, Pj).
③ the smoothed pixel values of these points are I (Pi, σ I) and I (Pj, σ j), respectively, the formula used to estimate the local gradient value g (Pi, Pj) is:
Figure GDA0003067546410000091
the set of all the combination modes is called a sampling point pair and is expressed as the following set:
Figure GDA0003067546410000092
defining short-distance point pair subset S and long-distance point pair subset L (L) as follows:
Figure GDA0003067546410000093
Figure GDA0003067546410000094
sixthly, the main direction of the characteristic point k is calculated as follows:
Figure GDA0003067546410000095
α=arctan2(gy,gx)
(3) the sampling mode facing to the BRISK is used for obtaining paired brightness comparison results and combining the results into a binary BRISK descriptor;
using a rotation angle α of a sampling point around a key point k, which is ARCTAN2(gy, gx), as a mode;
the descriptor of the BIRSK is also a vector containing 512 bits, each descriptor is generated by comparing two pairs of (P α i, P α j) epsilon S by a short-distance point pair, and the superscript alpha represents a rotation mode. Each bit b corresponds to:
Figure GDA0003067546410000096
Figure GDA0003067546410000097
③ matching two BRISK descriptors is a simple calculation of their Hamming distance in BRIEF: the bit number is different, the two descriptors are different in weight, and the operation of bitwise operation is reduced by bit counting;
and fourthly, extracting a comparison structure, and matching and associating the result of the image with a responsibility work order in the system after big data operation and comparison, so that business personnel can check the result.
The image recognition technology extracts the geographical region where the video occurs and the common problems of the video from the video by analyzing the declared data on the user line and comparing the declared video image data, and associates the video data with the responsibility confirmation work order, deeply obtains the service problems of the typical responsibility confirmation work order according to the management requirements, and plays a positive promoting role in complaint confirmation management.
The system integrates data generated by responsibility work orders automatically: the method integrates technologies such as real-time data distributed information and Hadoop offline data extraction, realizes quick access of unstructured data, constructs a distributed data integration function, has the acquisition and processing capacity of timing/real-time data, and realizes configuration and development from a data source to a big data platform.
The data storage of the automatic generation of the responsibility work order in the system is as follows: according to the diversity of data types, the storage layer adopts storage technologies such as a relational database, a distributed file system, a distributed online database and the like, provides data storage capabilities such as relational data storage, unstructured data storage and distributed file storage, provides a unified storage access interface, improves the low-cost lateral expansion capability of data storage, improves the quick data access response capability under high concurrency conditions, and meets the real-time and quasi-real-time storage requirements of mass data.
And (3) calculating data automatically generated by the responsibility work order: the platform layer analyzes and stores unstructured files by using data processing technologies such as batch computation, flow computation and the like, classifies and translates the data by using spark distributed technology and combining a picture recognition model and a decision tree analysis model, and places the translated data into a multi-thread queue for parallel distributed computation tasks.
And (3) analyzing data automatically generated by the responsibility work order: the information in the data is extracted from various aspects by using an S-PLUS language, and the data in distributed and heterogeneous data sources, such as relational data, plane data files and the like, is extracted to a temporary middle layer by using an ETL tool, then is cleaned, converted and integrated, and finally is loaded to a data warehouse or a data mart to become a basis for online analysis processing and data mining, so that the requirements of service development are met.
And finally, constructing a responsibility confirmation automatic generation work order model through a decision tree algorithm.
Since the complaint work order information of the user includes Chinese text, video recording, pdf picture and the like, and belongs to unstructured data, the text vocabulary, the video recording, the pdf picture and the like must be converted into a language which can be recognized by a computer for mining, and therefore modeling must be performed on the unstructured data. The present system models unstructured data using a decision tree model. The basic idea of the decision tree model is to divide unstructured data into a plurality of feature items, calculate the node of each feature item in the text by a specific means, start with a single node representing a training sample, if the samples are in the same class, the node is called a leaf of the node, select the attribute with the most classification capability as the current node of the decision tree by an algorithm, further represent the whole structured data by taking the weight of the feature items as the node and the leaf, divide the training sample data set into a plurality of subsets according to the difference of the attribute values of the current decision node, form a branch by each value, and form a plurality of branches by a plurality of values. And (4) repeating the previous step aiming at the subset obtained in the previous step to form a decision tree on each divided sample. Once an attribute appears on a node, it does not have to be considered on any descendant of the node, and the recursive partitioning step stops only when one of the following conditions holds: 1. all samples of a given node belong to the same class. 2. There are no remaining attributes that can be used to further partition the sample, in which case majority voting is used to convert a given node into a leaf, and the class with the highest number of tuples in the sample is used as the class label, while the class distribution of the node sample can also be stored. 3. If a branch does not satisfy a sample of existing classes in the branch, a leaf is created with most of the classes of the sample.
The calculation method of the decision tree model in the system adopts a TF-IDF method for calculation. Assuming that a certain node S contains S samples and m classes, each of the classes is corresponding to Ci, i belongs to {1,2, …, m } Ci, i belongs to {1,2, …, m }, and the number of samples in each class is si, the amount of information (entropy) required for determining the class of any one sample in the node is:
Figure GDA0003067546410000111
where pi is the probability that any sample belongs to the class Ci, and thus pi is si/s. Assuming that the feature a has v different values, { a1, a 2., av }, the node samples can be divided into v subsets { S1, S2., Sv }, by using the feature a, where Sj includes a set of samples whose value aj is taken by the feature a in the set S, and the corresponding number of samples is Sj. Assuming that Sij is the set of samples belonging to the category Ci in the subset Sj and the corresponding number of samples is Sij, the entropy of information required to determine any sample category in the subset Sj is:
Figure GDA0003067546410000112
the information entropy required for class determination of samples in all subsets should be a weighted average of the information entropy required for class determination of samples in each subset alone:
Figure GDA0003067546410000113
in the above formula
Figure GDA0003067546410000114
I.e. the weight of the jth subset.
The information gain of using the feature a to divide the sample of the current node into subsets is as follows:
Gain(A)=I(s1,s2,...,sm)-E(A)
selecting the characteristic with the maximum value of gain (A), namely E (A) > minimum, namely, performing subset division on the node samples by using the test characteristic selected by the method can minimize the information entropy weighted average value required for determining any sample category in the divided subsets; the information entropy reflects the uncertainty of the sample, the larger the information entropy is, the higher the uncertainty is, the smaller the information entropy is, and the lower the uncertainty is; that is, the test features selected using this method will make it easier to subsequently determine the sample class in the subset, thereby making the overall decision tree simpler.
Through a decision tree algorithm, according to the needs of the power business, relevant work order data are automatically extracted, a responsibility confirmation work order is generated, and the content of the work order automatically generated by the responsibility confirmation is analyzed; and accurately positioning the service work orders declared on the user line, classifying the responsibility confirmation work orders according to regions, designating responsibility handlers, and issuing the responsibility handlers to the responsibility handlers for processing.
The system utilizes the information of the marketing service application system of the power company, such as user information, organization structure information, complaints and the like, combines the video information acquired by the video monitoring system of 95598 and the file information acquired by a structured data platform, utilizes the image recognition and file retrieval technology, and analyzes and forms typical videos and image cases by constructing an image recognition model and a text full-text retrieval model, so that the scenes of supervision and management of complaint events of users, complaint responsibility implementation processing, comprehensive analysis and the like are realized, and decision support is provided for the company to improve the quality of power supply service and the service image.
Aiming at the defects that the traditional complaint responsibility confirmation method needs to analyze the complaint content of the user and check the complaint content to the client site by combining the business data in the marketing system before the application is developed, and the problem of low efficiency of mass data in traditional responsibility confirmation and image comparison is solved, through the informatization means of the system, under the image recognition and decision tree model technology, the customer complaint data is intelligently analyzed, the associated system data is automatically matched, the method has the advantages that complaint responsibility confirmation is carried out on line, the complaint responsibility confirmation period is shortened, the work order can be timely processed, the problems that the complaint work order confirmation workload is large, the communication period of confirmation counting personnel is long, the analysis of a confirmation result report is complex, the confirmation process is not standard and the like are solved, the complaint responsibility confirmation efficiency is effectively improved, the limitation of traditional offline circulation is solved, and the manpower resources of a company are saved. The system can automatically identify the complaint data needing to be subjected to responsibility confirmation, accurately position the minimum complaint responsibility confirmation department, analyze and mine the confirmation result data, is favorable for standardizing the complaint investigation and processing work of customers, and improves the power supply service capacity and the high-quality service level.

Claims (6)

1. A complaint responsibility confirmation system based on a text mining technology is characterized by comprising a data layer, a platform layer, a middleware, a service layer and a business layer;
the data layer comprises a relational database, a distributed file system and a non-relational database, the data layer periodically extracts data from a 95598 business support system and a marketing business application system of the national network headquarters, the data are stored in the relational database, the distributed file system and the non-relational database in a classified manner, the data are calculated through an image recognition technology and a decision tree processing technology, modeling is carried out through a data modeling and data mining mode, analysis and mining of the data are realized, and the application of a responsibility determination function module in the support system is realized;
the platform layer comprises a public service component, a service flow engine component and a Spark in the marketing development platform;
the middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in;
the service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service;
the business layer comprises responsibility identification, image identification and work order automatic generation;
the image recognition technology compares the declared video image data by analyzing the declared data on the user line, and associates the declared video image data with the related video data and the responsibility confirmation work order;
data integration: real-time data distributed information and a Hadoop offline data extraction technology are fused, rapid access of unstructured data is achieved, and a distributed data integration function is constructed;
data storage: providing relational data storage, unstructured data storage and distributed file storage, and providing a uniform storage access interface;
and (3) data calculation: the platform layer analyzes and stores unstructured files by using batch calculation and flow calculation, classifies and translates data by using spark distributed technology and combining a picture recognition model and a decision tree analysis model, and places the translated data into a multi-thread queue for parallel distributed calculation tasks;
and (3) data analysis: extracting information in data from various aspects by using an S-PLUS language, extracting data in distributed and heterogeneous data sources such as relational data and plane data files to a temporary middle layer by using an ETL tool, cleaning, converting and integrating the data, and finally loading the data to a data warehouse or a data mart to become a basis for online analysis processing and data mining;
and finally, constructing a responsibility confirmation automatic generation work order model through a decision tree algorithm.
2. The system of claim 1, wherein the common service components include unified coding, unified caching, unified validation rules, plug-in communication mode, Rest-based RPC.
3. The system of claim 1, wherein said image recognition technique uses a binary descriptor generated by a BRISK image recognition algorithm using a test random response to create a descriptor.
4. The system of claim 1, wherein the system employs a text-based processing method to process grammar, the text-based processing method using conditional random field models to extract keyword models, comprising the steps of:
(1) in the first step of modeling, the model basic data maintenance function needs to load some pre-training word embedding, and simultaneously, the meanings of some key words are extracted from characters;
(2) it is necessary to get a meaningful representation of each word in the context using LSTM;
(3) the keyword extraction model algorithm is established as a binary problem, whether words or phrases in the documents are keywords or not is judged, since the problem is the classification problem, labeled training corpora need to be provided, the keyword extraction model is trained by utilizing the training corpora, and the keywords of the documents needing to be extracted are extracted according to the model.
5. The system of claim 4, wherein the keyword extraction model algorithm is a linear chain CRF, in which features depend only on the current label and the previous label, but not on any label in the whole sentence, and in order to model this behavior, a feature function is used, which contains a plurality of input values:
sentence s
The position li of the word in the sentence
The label l _ i of the current word
The label li-1 of the previous word
Next, each feature function f _ j is given a weight λ _ j, given a sentence s, the label l of s is calculated according to the following formula: the weighted features of all words in the sentence are summed,
Figure FDA0003067546400000021
finally, these scores are converted into probabilities p (l | s) between 0 and 1 by taking an index and normalizing,
Figure FDA0003067546400000022
6. the system for complaint liability identification based on text mining technology as claimed in any one of claims 1 to 5, wherein said decision tree processing technology is modeled using a decision tree model whose calculation method is calculated using a TF-IDF method; assuming that a certain node S contains S samples and m classes, each of the classes is corresponding to Ci, i belongs to {1,2, …, m } Ci, i belongs to {1,2, …, m }, and the number of samples in each class is si, the amount of information (entropy) required for determining the class of any one sample in the node is:
Figure FDA0003067546400000031
where pi is the probability that any sample belongs to the class Ci, and thus pi is si/s; assuming that the feature a has v different values, { a1, a 2.. multidata, av }, the node samples can be divided into v subsets { S1, S2.. multidata, Sv } by using the feature a, where Sj includes a set of samples of the feature a with aj values in the set S, and the corresponding number of samples is Sj; assuming that Sij is the set of samples belonging to the category Ci in the subset Sj and the corresponding number of samples is Sij, the entropy of information required to determine any sample category in the subset Sj is:
Figure FDA0003067546400000032
the entropy of information required for class determination of samples in all subsets should be a weighted average of the entropy of information required for class determination of samples in each subset alone:
Figure FDA0003067546400000033
in the above formula
Figure FDA0003067546400000034
The weight of the jth subset is obtained;
the information gain of using the feature a to divide the sample of the current node into subsets is as follows:
Gain(A)=I(s1,s2,...,sm)-E(A)
selecting the characteristic with the maximum value of gain (A), namely E (A) > minimum, namely, performing subset division on the node samples by using the test characteristic selected by the method can minimize the information entropy weighted average value required for determining any sample category in the divided subsets; the information entropy reflects the uncertainty of the sample, the larger the information entropy is, the higher the uncertainty is, the smaller the information entropy is, and the lower the uncertainty is; that is, the test features selected using this method will make it easier to subsequently determine the sample class in the subset, thereby making the overall decision tree simpler.
CN201811211967.9A 2018-10-18 2018-10-18 Complaint responsibility confirmation system based on text mining technology Active CN109408574B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811211967.9A CN109408574B (en) 2018-10-18 2018-10-18 Complaint responsibility confirmation system based on text mining technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811211967.9A CN109408574B (en) 2018-10-18 2018-10-18 Complaint responsibility confirmation system based on text mining technology

Publications (2)

Publication Number Publication Date
CN109408574A CN109408574A (en) 2019-03-01
CN109408574B true CN109408574B (en) 2021-08-31

Family

ID=65468489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811211967.9A Active CN109408574B (en) 2018-10-18 2018-10-18 Complaint responsibility confirmation system based on text mining technology

Country Status (1)

Country Link
CN (1) CN109408574B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019106A (en) * 2019-03-21 2019-07-16 国网江西省电力有限公司萍乡供电分公司 A kind of power marketing method and system for processing mass data of smart grid
CN111414198B (en) * 2020-03-18 2023-05-02 北京字节跳动网络技术有限公司 Request processing method and device
CN113010747B (en) * 2021-03-25 2024-04-26 深圳赛安特技术服务有限公司 Information matching method, device, equipment and storage medium
CN115442832B (en) * 2021-06-03 2024-04-09 中国移动通信集团四川有限公司 Complaint problem positioning method and device and electronic equipment
CN113254644B (en) * 2021-06-07 2021-09-17 成都数之联科技有限公司 Model training method, non-complaint work order processing method, system, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123675A (en) * 2013-04-27 2014-10-29 国家电网公司 Power distribution network simulation research and analysis system and method based on network-wide data
CN104735102A (en) * 2013-12-18 2015-06-24 广州岚萃贸易有限公司 Customer relation management system based on cloud platform and cloud computing
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN105760493A (en) * 2016-02-18 2016-07-13 国网江苏省电力公司电力科学研究院 Automatic work order classification method for electricity marketing service hot spot 95598
CN107864222A (en) * 2017-12-14 2018-03-30 北京航天测控技术有限公司 A kind of industrial big data computing architecture based on PaaS platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612599B2 (en) * 2011-09-07 2013-12-17 Accenture Global Services Limited Cloud service monitoring system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123675A (en) * 2013-04-27 2014-10-29 国家电网公司 Power distribution network simulation research and analysis system and method based on network-wide data
CN104735102A (en) * 2013-12-18 2015-06-24 广州岚萃贸易有限公司 Customer relation management system based on cloud platform and cloud computing
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN105760493A (en) * 2016-02-18 2016-07-13 国网江苏省电力公司电力科学研究院 Automatic work order classification method for electricity marketing service hot spot 95598
CN107864222A (en) * 2017-12-14 2018-03-30 北京航天测控技术有限公司 A kind of industrial big data computing architecture based on PaaS platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
电力企业投诉工单文本挖掘模型;刘兴平等;《万方数据库》;20160525;正文第57-60页 *

Also Published As

Publication number Publication date
CN109408574A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109408574B (en) Complaint responsibility confirmation system based on text mining technology
Xia et al. Distantly supervised lifelong learning for large-scale social media sentiment analysis
CN108717408B (en) Sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
US10883345B2 (en) Processing of computer log messages for visualization and retrieval
CN110633366B (en) Short text classification method, device and storage medium
Mao et al. Mapping near-real-time power outages from social media
CN111967761B (en) Knowledge graph-based monitoring and early warning method and device and electronic equipment
Paul et al. LeSICiN: a heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents
CN111723256A (en) Government affair user portrait construction method and system based on information resource library
CN115794798B (en) Market supervision informatization standard management and dynamic maintenance system and method
CN113590807A (en) Scientific and technological enterprise credit evaluation method based on big data mining
CN115827797A (en) Environmental data analysis and integration method and system based on big data
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN112559741B (en) Nuclear power equipment defect record text classification method, system, medium and electronic equipment
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN109710730B (en) Patrol information system and analysis method based on natural language analysis processing
Li et al. Evaluating BERT on cloud-edge time series forecasting and sentiment analysis via prompt learning
CN113222471B (en) Asset wind control method and device based on new media data
CN115759253A (en) Power grid operation and maintenance knowledge map construction method and system
Cheng et al. Knowledge-aware method for confusing charge prediction
CN116842936A (en) Keyword recognition method, keyword recognition device, electronic equipment and computer readable storage medium
CN114797114A (en) Real-time intelligent identification method and system for game chat advertisement
Qiu et al. An ex-convict recognition method based on text mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20211028

Address after: No. 150, Jinger Road, Daguanyuan, Shizhong District, Jinan City, Shandong Province

Patentee after: Shandong Electric Power Marketing Center

Patentee after: ELECTRIC POWER RESEARCH INSTITUTE OF STATE GRID SHANDONG ELECTRIC POWER Co.

Patentee after: STATE GRID CORPORATION OF CHINA

Address before: No.2000 Wangyue Road, Shizhong District, Jinan City, Shandong Province

Patentee before: ELECTRIC POWER RESEARCH INSTITUTE OF STATE GRID SHANDONG ELECTRIC POWER Co.

Patentee before: STATE GRID CORPORATION OF CHINA

TR01 Transfer of patent right