Disclosure of Invention
The invention aims to provide a complaint responsibility confirmation system based on text mining technology, which can automatically generate a responsibility confirmation work order.
In order to solve the technical problem, the invention comprises a data layer, a platform layer, a middleware, a service layer and a service layer;
the data layer comprises a relational database, a distributed file system and a non-relational database, the data layer periodically extracts data from a 95598 business support system and a marketing business application system of the national network headquarters, the data are stored in the relational database, the distributed file system and the non-relational database in a classified manner, the data are calculated through an image recognition technology and a decision tree processing technology, modeling is carried out through a data modeling and data mining mode, analysis and mining of the data are realized, and the application of a responsibility determination function module in the support system is realized;
the platform layer comprises a public service component, a service flow engine component and a Spark in the marketing development platform;
the middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in;
the service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service;
the business layer comprises responsibility identification, image identification and work order automatic generation;
the image recognition technology compares the declared video image data by analyzing the declared data on the user line, and associates the declared video image data with the related video data and the responsibility confirmation work order;
data integration: real-time data distributed information and a Hadoop offline data extraction technology are fused, rapid access of unstructured data is achieved, and a distributed data integration function is constructed;
data storage: providing relational data storage, unstructured data storage and distributed file storage, and providing a uniform storage access interface;
and (3) data calculation: the platform layer analyzes and stores unstructured files by using batch calculation and flow calculation, classifies and translates data by using spark distributed technology and combining a picture recognition model and a decision tree analysis model, and places the translated data into a multi-thread queue for parallel distributed calculation tasks;
and (3) data analysis: extracting information in data from various aspects by using an S-PLUS language, extracting data in distributed and heterogeneous data sources such as relational data and plane data files to a temporary middle layer by using an ETL tool, cleaning, converting and integrating the data, and finally loading the data to a data warehouse or a data mart to become a basis for online analysis processing and data mining;
and finally, constructing a responsibility confirmation automatic generation work order model through a decision tree algorithm.
The common service component comprises unified coding, unified cache, unified verification rules, a plug-in communication mode and RPC based on Rest.
The image recognition technology adopts a BRISK image recognition algorithm, and uses detection random response to establish a descriptor to generate a binary descriptor.
The BRISK image recognition algorithm is realized by the following steps:
(1) detecting characteristic points of the scale space: the image and scale dimensions are all characterized in that interested feature points are identified by using a significance standard, in order to improve the calculation efficiency, feature points are detected in the middle of layers of an image pyramid, and the position and the scale of each key point are obtained by fitting a quadratic function in a continuous region;
(2) and (3) key point detection: the sample pattern of points lies on a well-scaled concentric circle, which is used to detect gray values at adjacent locations of each keypoint: determining the direction of the feature description with respect to the intensity gradient of the treatment;
the BRISK binary descriptors are all composed of binary strings through indirect simple brightness comparison test results, and the BRISK determines the characteristic direction of each characteristic point so as to obtain direction equalization descriptors;
firstly, the key of BRISK description is to utilize the mode used by the adjacent positions of the pixel collection key points, take the key points as the center, collect the circles of N characteristic points around the key points, and define a plurality of equal local circular areas;
applying a Gaussian smoothing method to sampling points Pi in the mode, enabling the standard deviation delta i to be proportional to the distance of each sampling point corresponding to the respective center, correspondingly patterning the positioning and expanding mode as key points k in the image, considering an N (N-1)/2 sampling point pairs, and expressing the sampling point pairs by using a set (Pi, Pj);
③ the smoothed pixel values of these points are I (Pi, σ I) and I (Pj, σ j), respectively, the formula used to estimate the local gradient value g (Pi, Pj) is:
the set of all the combination modes is called a sampling point pair and is expressed as the following set:
defining short-distance point pair subset S and long-distance point pair subset L (L) as follows:
sixthly, the main direction of the characteristic point k is calculated as follows:
α=arctan2(gy,gx)
(3) the sampling mode facing to the BRISK is used for obtaining paired brightness comparison results and combining the results into a binary BRISK descriptor;
using a rotation angle α of a sampling point around a key point k, which is ARCTAN2(gy, gx), as a mode;
the descriptor of the BIRSK is also a vector containing 512 bits, each descriptor is generated by comparing two pairs of (P α i, P α j) epsilon S by a short-distance point pair, the superscript alpha represents a rotation mode, and each bit b corresponds to:
③ matching two BRISK descriptors is a simple calculation of their Hamming distance in BRIEF: the bit number is different, the two descriptors are different in weight, and the operation of bitwise operation is reduced by bit counting;
and fourthly, extracting a comparison structure, and matching and associating the result of the image with a responsibility work order in the system after big data operation and comparison, so that business personnel can check the result.
The system adopts a text processing method based on natural language to process grammar, and uses a conditional random field model to extract a keyword model, and the method comprises the following specific steps:
(1) in the first step of modeling, the model basic data maintenance function needs to load some pre-training word embedding, and simultaneously, the meanings of some key words are extracted from characters;
(2) it is necessary to get a meaningful representation of each word in the context using LSTM;
(3) the keyword extraction model algorithm is established as a binary problem, whether words or phrases in the documents are keywords or not is judged, since the problem is the classification problem, labeled training corpora need to be provided, the keyword extraction model is trained by utilizing the training corpora, and the keywords of the documents needing to be extracted are extracted according to the model.
The algorithm of the keyword extraction model is a linear chain CRF, in the linear chain CRF, features only depend on the current label and the previous label, but not on any label in the whole sentence, and in order to model the behavior, a feature function is used, and the function comprises a plurality of input values:
sentence s
The position li of the word in the sentence
The label l _ i of the current word
The label li-1 of the previous word
Next, each feature function f _ j is given a weight λ _ j. Given a sentence s, we can now calculate the label/of s according to: the weighted features of all words in the sentence are summed,
finally, we can convert these scores into probabilities p (l | s) between 0 and 1 by taking an index and normalizing,
the decision tree processing technology is modeled by using a decision tree model, and the calculation method of the decision tree model adopts a TF-IDF method for calculation. Assuming that a certain node S contains S samples and m classes, each of the classes is corresponding to Ci, i belongs to {1,2, …, m } Ci, i belongs to {1,2, …, m }, and the number of samples in each class is si, the amount of information (entropy) required for determining the class of any one sample in the node is:
wherein pi is a probability that any sample belongs to the class Ci, and thus pi is si/S, and it is assumed that the feature a has v different values, { a1, a2,. and.. av }, then the node sample can be divided into v subsets { S1, S2,. and Sv } by using the feature a, Sj includes a sample set in which the feature a in the set S takes the aj value, the corresponding number of samples is Sj, and it is assumed that Sij is a sample set in the subset Sj that belongs to the class Ci, and the corresponding number of samples is Sij, then the entropy required for determining any sample class in the subset Sj is:
the information entropy required for class determination of samples in all subsets should be a weighted average of the information entropy required for class determination of samples in each subset alone:
in the above formula
I.e. the weight of the jth subset,
the information gain of using the feature a to divide the sample of the current node into subsets is as follows:
Gain(A)=I(s1,s2,...,sm)-E(A)
selecting the characteristic with the maximum value of gain (A), namely E (A) > minimum, namely, performing subset division on the node samples by using the test characteristic selected by the method can minimize the information entropy weighted average value required for determining any sample category in the divided subsets; the information entropy reflects the uncertainty of the sample, the larger the information entropy is, the higher the uncertainty is, the smaller the information entropy is, and the lower the uncertainty is; that is, the test features selected using this method will make it easier to subsequently determine the sample class in the subset, thereby making the overall decision tree simpler.
After the structure is adopted, the system utilizes the information of the user information, organization structure information, complaints and the like of the marketing service application system of the power company, combines the video information acquired by the video monitoring system of 95598 and the file information acquired by the structured data platform, utilizes the image recognition and file retrieval technology, analyzes and forms typical videos and image cases by constructing an image recognition model and a text full-text retrieval model, realizes the scenes of supervision and management of complaint events of the user, complaint responsibility implementation processing, comprehensive analysis and the like, and provides decision support for the improvement of power supply service quality and service image of the company. The system intelligently analyzes the customer complaint data and automatically matches the associated system data under the image recognition and decision tree model technology, the complaint responsibility confirmation is developed on line, the complaint responsibility confirmation period is shortened, the work order can be timely processed, the problem that the complaint work order confirmation workload is large, the communication period of the confirmation statistic personnel is long, the report analysis of the confirmation result is complex, the confirmation process is not standard and the like is solved, the efficiency of complaint responsibility confirmation is effectively improved, the limitation of traditional off-line circulation is solved, and the human resources of a company are saved. The system can automatically identify the complaint data needing to be subjected to responsibility confirmation, accurately position the minimum complaint responsibility confirmation department, analyze and mine the confirmation result data, is favorable for standardizing the complaint investigation and processing work of customers, and improves the power supply service capacity and the high-quality service level.
Detailed Description
The complaint responsibility confirmation system based on the text mining technology is developed based on a J2EE enterprise-level application framework platform, the whole technical framework adopts an SOA service-oriented management framework mode, a B/S system framework is adopted, and the design concept of service support is provided in a layered mode, so that the effective integration of all application components is realized, and the service is provided for users in a WEB mode. As shown in fig. 1, the complaint responsibility confirmation system based on the text mining technology includes a data layer, a platform layer, middleware, a service layer, and a business layer. The data layer comprises a relational database, a distributed file system and a non-relational database, and the data source of the system mainly comprises 95598 business support system, user file data, organization structure data and other external data in a national network marketing business application system. The data layer periodically extracts data from the data source, classifies and stores the data in a relational database, a distributed file system and a non-relational database, calculates the data of the data layer through an image recognition technology and a decision tree algorithm technology, and models through data modeling, data mining and other modes to realize the analysis and mining of the data. The platform layer comprises a public business component, a business flow engine component and a Spark (big data cloud computing platform) in the marketing development platform. The common service components comprise unified coding, unified caching, unified verification rules, a plug-in communication mode and RPC based on Rest. The middleware comprises a Zookeeper, an image recognition algorithm plug-in and a decision tree algorithm plug-in. The service layer comprises a uniform coding service, a uniform caching service, a bidding model calculation service, a flow service and a file storage service. The business layer comprises responsibility identification, image recognition and work order automatic generation.
The system processes the grammar by adopting a text processing method based on natural language, extracts key information from a large text segment input by a user in the system by using a keyword extraction model algorithm, provides the service personnel for checking the key information in the text, and is convenient for the service personnel to operate. The goal of natural language based processing of text is to label each word in the text as one of four categories: power services, individuals, miscellaneous, and others, and then find the most prominent power service and name in the text. The above labeling is done for each word using a keyword extraction classification, and then the keyword extraction model filters out unwanted labels using a rule-based approach and determines the most prominent power business and name. The method for processing the text based on the natural language uses a conditional random field model to extract a keyword model, and comprises the following specific steps:
(1) in the first step of modeling, the model basic data maintenance function needs to load some pre-training word embedding, and simultaneously, the meanings of some key words are extracted from characters;
(2) it is necessary to get a meaningful representation of each word in the context using LSTM;
(3) the keyword extraction model algorithm is established as a binary problem, and whether a word or a phrase in a document is a keyword or not is judged. Since the problem is the classification problem, it is necessary to provide the labeled corpus, train the keyword extraction model using the corpus, and extract the keywords from the documents whose keywords are to be extracted according to the model.
In the keyword extraction model algorithm, our input data is sequence data, and at the same time, when prediction is performed on a certain data point, the context of the previous text needs to be considered. The algorithm of the keyword extraction model in the system is a linear chain CRF, and in the linear chain CRF, the characteristics only depend on the current label and the previous label, but not on any label in the whole sentence. To model this behavior, we will use a feature function that contains a number of input values:
sentence s
The position li of the word in the sentence
The label l _ i of the current word
The label li-1 of the previous word
Next, each feature function f _ j is given a weight λ _ j. Given a sentence s, we can now calculate the label/of s according to: the weighted features of all words in the sentence are summed.
Part-of-speech tagging-based feature function examples
If l _ i ═ ADVERB and the ith word ends with "" -ly ″, then f _1(s, i, l _ i, l _ i-1) ═ 1, else 0 is taken. If the corresponding weight λ 1 is positive and very large, this feature basically means that we tend to label words ending with "-ly" as ADVERB.
If i is 1, l _ i is VERB and the sentence ends with a question mark, then f _2(s, i, l _ i, l _ i-1) is 1, otherwise 0 is taken. If the corresponding weight λ 2 is positive and very large, this feature essentially indicates that we tend to label the first word of a question sentence as VERB. (e.g., "Is this a content marking with a verbs
If l _ i-1 is ADJECTIVE and l _ i is NOUN, then f _3(s, i, l _ i, l _ i-1) is 1, otherwise it is 0. A positive correspondence weight means that we tend to consider nouns to follow adjectives.
If l _ i-1 and l _ i are previous, then f _4(s, i, l _ i, l _ i-1) is 1. The weight λ 4 for this function is negative, indicating that a preposition should not follow another preposition, so we should avoid such labeling.
Finally, we can convert these scores into probabilities p (l | s) between 0 and 1 by taking the index and normalizing.
Establishing a keyword extraction model algorithm, a group of characteristic functions (which can depend on the whole sentence, the current position of a word and the label of a nearby word) are defined, weights are given, and then the weights are added up, and finally, if necessary, the probability form can be converted.
The system extracts the relevant documents of the user from a marketing business system, a 95598 customer service system of the national network headquarters and other business systems of customer service, utilizes a conditional random field model algorithm to count the text content of a responsibility work order, extracts key business information, automatically generates the work order, and provides power company business personnel with responsibility for judging whether the responsibility is generated, so that the workload of the business personnel can be reduced.
The image recognition technology in the system adopts a BRISK image recognition algorithm, and uses detection random response to establish a descriptor so as to generate a binary descriptor.
The BRISK image recognition algorithm is realized by the following steps:
(1) detecting characteristic points of the scale space: both the image and scale dimensions are used to identify the feature points of interest by using a criterion of significance. In order to improve the calculation efficiency, feature points are detected in the middle of layers of the image pyramid. Obtaining the position and scale of each key point by fitting a quadratic function to the continuous region;
(2) and (3) key point detection: the sample pattern of points lies on a well-scaled concentric circle, which is used to detect gray values at adjacent locations of each keypoint: determining the direction of the feature description with respect to the intensity gradient of the treatment;
BRISK binary descriptors are all composed of binary strings through indirect simple luminance comparison test results. The BRISK determines the characteristic direction of each characteristic point so as to obtain a descriptor of direction equalization;
the key of BRISK is to use the mode of pixel to collect the adjacent position of key point. The key point is used as the center, and the circles of N characteristic points are collected around the key point to define a plurality of equal local circular areas.
Secondly, applying a Gaussian smoothing method to sampling points Pi in the mode, enabling the standard deviation delta i to be proportional to the distance of each sampling point corresponding to the center of each sampling point, correspondingly patterning the positioning and expanding mode for key points k in the image, considering one N (N-1)/2 sampling point pairs, and expressing the sampling point pairs by a set (Pi, Pj).
③ the smoothed pixel values of these points are I (Pi, σ I) and I (Pj, σ j), respectively, the formula used to estimate the local gradient value g (Pi, Pj) is:
the set of all the combination modes is called a sampling point pair and is expressed as the following set:
defining short-distance point pair subset S and long-distance point pair subset L (L) as follows:
sixthly, the main direction of the characteristic point k is calculated as follows:
α=arctan2(gy,gx)
(3) the sampling mode facing to the BRISK is used for obtaining paired brightness comparison results and combining the results into a binary BRISK descriptor;
using a rotation angle α of a sampling point around a key point k, which is ARCTAN2(gy, gx), as a mode;
the descriptor of the BIRSK is also a vector containing 512 bits, each descriptor is generated by comparing two pairs of (P α i, P α j) epsilon S by a short-distance point pair, and the superscript alpha represents a rotation mode. Each bit b corresponds to:
③ matching two BRISK descriptors is a simple calculation of their Hamming distance in BRIEF: the bit number is different, the two descriptors are different in weight, and the operation of bitwise operation is reduced by bit counting;
and fourthly, extracting a comparison structure, and matching and associating the result of the image with a responsibility work order in the system after big data operation and comparison, so that business personnel can check the result.
The image recognition technology extracts the geographical region where the video occurs and the common problems of the video from the video by analyzing the declared data on the user line and comparing the declared video image data, and associates the video data with the responsibility confirmation work order, deeply obtains the service problems of the typical responsibility confirmation work order according to the management requirements, and plays a positive promoting role in complaint confirmation management.
The system integrates data generated by responsibility work orders automatically: the method integrates technologies such as real-time data distributed information and Hadoop offline data extraction, realizes quick access of unstructured data, constructs a distributed data integration function, has the acquisition and processing capacity of timing/real-time data, and realizes configuration and development from a data source to a big data platform.
The data storage of the automatic generation of the responsibility work order in the system is as follows: according to the diversity of data types, the storage layer adopts storage technologies such as a relational database, a distributed file system, a distributed online database and the like, provides data storage capabilities such as relational data storage, unstructured data storage and distributed file storage, provides a unified storage access interface, improves the low-cost lateral expansion capability of data storage, improves the quick data access response capability under high concurrency conditions, and meets the real-time and quasi-real-time storage requirements of mass data.
And (3) calculating data automatically generated by the responsibility work order: the platform layer analyzes and stores unstructured files by using data processing technologies such as batch computation, flow computation and the like, classifies and translates the data by using spark distributed technology and combining a picture recognition model and a decision tree analysis model, and places the translated data into a multi-thread queue for parallel distributed computation tasks.
And (3) analyzing data automatically generated by the responsibility work order: the information in the data is extracted from various aspects by using an S-PLUS language, and the data in distributed and heterogeneous data sources, such as relational data, plane data files and the like, is extracted to a temporary middle layer by using an ETL tool, then is cleaned, converted and integrated, and finally is loaded to a data warehouse or a data mart to become a basis for online analysis processing and data mining, so that the requirements of service development are met.
And finally, constructing a responsibility confirmation automatic generation work order model through a decision tree algorithm.
Since the complaint work order information of the user includes Chinese text, video recording, pdf picture and the like, and belongs to unstructured data, the text vocabulary, the video recording, the pdf picture and the like must be converted into a language which can be recognized by a computer for mining, and therefore modeling must be performed on the unstructured data. The present system models unstructured data using a decision tree model. The basic idea of the decision tree model is to divide unstructured data into a plurality of feature items, calculate the node of each feature item in the text by a specific means, start with a single node representing a training sample, if the samples are in the same class, the node is called a leaf of the node, select the attribute with the most classification capability as the current node of the decision tree by an algorithm, further represent the whole structured data by taking the weight of the feature items as the node and the leaf, divide the training sample data set into a plurality of subsets according to the difference of the attribute values of the current decision node, form a branch by each value, and form a plurality of branches by a plurality of values. And (4) repeating the previous step aiming at the subset obtained in the previous step to form a decision tree on each divided sample. Once an attribute appears on a node, it does not have to be considered on any descendant of the node, and the recursive partitioning step stops only when one of the following conditions holds: 1. all samples of a given node belong to the same class. 2. There are no remaining attributes that can be used to further partition the sample, in which case majority voting is used to convert a given node into a leaf, and the class with the highest number of tuples in the sample is used as the class label, while the class distribution of the node sample can also be stored. 3. If a branch does not satisfy a sample of existing classes in the branch, a leaf is created with most of the classes of the sample.
The calculation method of the decision tree model in the system adopts a TF-IDF method for calculation. Assuming that a certain node S contains S samples and m classes, each of the classes is corresponding to Ci, i belongs to {1,2, …, m } Ci, i belongs to {1,2, …, m }, and the number of samples in each class is si, the amount of information (entropy) required for determining the class of any one sample in the node is:
where pi is the probability that any sample belongs to the class Ci, and thus pi is si/s. Assuming that the feature a has v different values, { a1, a 2., av }, the node samples can be divided into v subsets { S1, S2., Sv }, by using the feature a, where Sj includes a set of samples whose value aj is taken by the feature a in the set S, and the corresponding number of samples is Sj. Assuming that Sij is the set of samples belonging to the category Ci in the subset Sj and the corresponding number of samples is Sij, the entropy of information required to determine any sample category in the subset Sj is:
the information entropy required for class determination of samples in all subsets should be a weighted average of the information entropy required for class determination of samples in each subset alone:
in the above formula
I.e. the weight of the jth subset.
The information gain of using the feature a to divide the sample of the current node into subsets is as follows:
Gain(A)=I(s1,s2,...,sm)-E(A)
selecting the characteristic with the maximum value of gain (A), namely E (A) > minimum, namely, performing subset division on the node samples by using the test characteristic selected by the method can minimize the information entropy weighted average value required for determining any sample category in the divided subsets; the information entropy reflects the uncertainty of the sample, the larger the information entropy is, the higher the uncertainty is, the smaller the information entropy is, and the lower the uncertainty is; that is, the test features selected using this method will make it easier to subsequently determine the sample class in the subset, thereby making the overall decision tree simpler.
Through a decision tree algorithm, according to the needs of the power business, relevant work order data are automatically extracted, a responsibility confirmation work order is generated, and the content of the work order automatically generated by the responsibility confirmation is analyzed; and accurately positioning the service work orders declared on the user line, classifying the responsibility confirmation work orders according to regions, designating responsibility handlers, and issuing the responsibility handlers to the responsibility handlers for processing.
The system utilizes the information of the marketing service application system of the power company, such as user information, organization structure information, complaints and the like, combines the video information acquired by the video monitoring system of 95598 and the file information acquired by a structured data platform, utilizes the image recognition and file retrieval technology, and analyzes and forms typical videos and image cases by constructing an image recognition model and a text full-text retrieval model, so that the scenes of supervision and management of complaint events of users, complaint responsibility implementation processing, comprehensive analysis and the like are realized, and decision support is provided for the company to improve the quality of power supply service and the service image.
Aiming at the defects that the traditional complaint responsibility confirmation method needs to analyze the complaint content of the user and check the complaint content to the client site by combining the business data in the marketing system before the application is developed, and the problem of low efficiency of mass data in traditional responsibility confirmation and image comparison is solved, through the informatization means of the system, under the image recognition and decision tree model technology, the customer complaint data is intelligently analyzed, the associated system data is automatically matched, the method has the advantages that complaint responsibility confirmation is carried out on line, the complaint responsibility confirmation period is shortened, the work order can be timely processed, the problems that the complaint work order confirmation workload is large, the communication period of confirmation counting personnel is long, the analysis of a confirmation result report is complex, the confirmation process is not standard and the like are solved, the complaint responsibility confirmation efficiency is effectively improved, the limitation of traditional offline circulation is solved, and the manpower resources of a company are saved. The system can automatically identify the complaint data needing to be subjected to responsibility confirmation, accurately position the minimum complaint responsibility confirmation department, analyze and mine the confirmation result data, is favorable for standardizing the complaint investigation and processing work of customers, and improves the power supply service capacity and the high-quality service level.