Summary of the invention
The present invention is directed to the shortcoming that analysis efficiency in prior art is low, it is provided that power failure based on text analyzing
Reason identification system.
In order to solve above-mentioned technical problem, the present invention is addressed by following technical proposals:
Safety at power cut identification system based on text analyzing, including data base and processor, record in data base
Have and complained for having a power failure when providing customer service by contact staff and record the power failure data of generation, processor
Inside it is provided with text partition and filters expert system module, root because identifying that expert system module and HDSP identify mould
Block;
Text partition with filter expert system module power failure data are carried out text partition with filter and make partition with
Every power failure data after filtration have and an only safety at power cut, text partition with filter expert system module
Including text partition unit and filtration specialist system unit, power failure data are passed sequentially through funny by text partition unit
Number, split and branch successively split, filter specialist system unit will partition after power failure data carried out
Filter and remove the data unrelated with safety at power cut;
Root is because identifying that expert system module power failure data after partition is filtered extract general character rule, and passes through general character
Power failure data are analyzed and draw identification text by rule;
With filtration expert system module and Gen Yin, text partition is identified that expert system module is entered by HDSP identification module
Row is analyzed and unrecognized power failure data carry out secondary analysis and obtain and arrive identification text.
As preferably, root is because identifying that expert system module also includes Rule unit, rule base unit and the fact
Library unit;
Rule unit extracts general character rule, by the property of general character rule to through partition and the power failure data filtered
Can parameter compare with the first threshold being set in advance in rule base unit, when the performance ginseng of general character rule
When the accuracy rate that number identifies is higher than the accuracy rate of first threshold, then by the performance parameter of this general character rule with true
Second Threshold in library unit carries out accuracy rate and compares, if the accuracy rate of the performance parameter of this general character rule is higher than
The accuracy rate of Second Threshold, otherwise, then continue to optimize this general character rule;
Include the coupling word for different safety at power cut identifications in rule base unit, by this general character rule with mate
Word carries out mating and draw the identification text that these power failure data are corresponding;
Factbase include industry background knowledge, initial text data, later stage labeling data and root because of know
The recognition performance data produced in other expert system module running.
As preferably, root is because identifying that expert system module also includes inference machine, man-machine interaction unit and explanation list
Unit;The logicality relation inference of inference machine rule-based reasoning in rule base unit, man-machine interaction unit bag
Including human-computer interaction interface, engineer carries out rule base unit and the number of factbase unit by human-computer interaction interface
According to improving and carrying out new Rule, the recognition result of safety at power cut is presented on human-computer exchange by Interpretation unit
User directly it is presented on interface.
As preferably, Unidentified power failure data are extracted and generate training text by HDSP identification module,
By the analysis of training text being drawn performance parameter, using performance parameter to generate and identifying text and to remaining
Unidentified power failure data carry out the identification of safety at power cut.
As preferably, obtaining θ and p (θ) by training text, the θ vector that is the theme i.e. represents that each theme of each column exists
Document occur probability, p (θ) be the theme vector θ Dirichlet distribution, then draw two control parameter alpha and
β, α are the parameter that p (θ) is distributed, for generating a theme θ vector;β is the word that each theme is corresponding
Probability distribution matrix p (w | z), determined that topic model, model generation identification literary composition by controlling parameter alpha and β
This algorithm is as follows: (1) selectes a theme vector θ, determines the selected probability of each theme;(2) from
Selecting a theme z in theme distribution vector θ, be distributed by the Word probability of theme z and generate a word, this word is i.e.
For identifying text.
As preferably, Unidentified power failure data are extracted and generate test text by HDSP identification module,
By artificially test text being carried out safety at power cut identification, it is judged that control parameter alpha and β that training text draws are
No rationally and be adjusted.
Due to the fact that and have employed above technical scheme there is significant technique effect: this patent is considering now
In the case of one content of text cannot be identified multiple safety at power cut by some machine learning classification algorithms, first
Process text data carrying out text partition and filter specialist system, then comprehensive utilization root is because identifying expert
System and HDSP identify that model carries out root because identifying to content of text, it is achieved that carry out a content of text
The function of multiple safety at power cut identifications.Text partition makes root because identifying specialist system with filtering specialist system
Identification range is less, is more beneficial for the foundation of rule, substantially increases root because identifying the identity of specialist system
Energy;Result after partition is filtered meets a content of text and comprises only a safety at power cut, so that
Machine learning classification algorithm can effectively use.During rule is set up, one is needed in view of specialist system
The individual process iterated, therefore only by root because identifying that specialist system carries out root because identifying to content of text, can
Can make part text data unrecognized go out corresponding safety at power cut.For these reasons, therefore make further
Identify that the text data that this part is unrecognized is recognized for by model with HDSP, this considerably reduce
The quantity that text data is unrecognized, also further improves root because identifying that specialist system is on identification function
Deficiency.This patent can help client to lock safety at power cut from rambling work order, and clearly defining responsibilities is returned
Belong to, create conditions for improving service quality, the management of reinforcement client service center and lifting user satisfaction;Be conducive to enterprise
Industry processes power-off event in time, establishes good corporate image for enterprise.Model used and system are automatically
Realizing, its evaluation criterion performance objective, integrated is good, greatly reduces the workload of staff, solves
Staff causes the inconsistent problem of result system because of subjective reason.
Embodiment 1
Safety at power cut identification system based on text analyzing, including data base and processor, record in data base
Have and complained for having a power failure when providing customer service by contact staff and record the power failure data of generation, processor
Inside it is provided with text partition and filters expert system module, root because identifying that expert system module and HDSP identify mould
Block;
Text partition with filter expert system module power failure data are carried out text partition with filter and make partition with
Every power failure data after filtration have and an only safety at power cut, text partition with filter expert system module
Including text partition unit and filtration specialist system unit, power failure data are passed sequentially through funny by text partition unit
Number, split and branch successively split, filter specialist system unit will partition after power failure data carried out
Filter and remove the data unrelated with safety at power cut;
Root is because identifying that the power failure data after partition filtration are extracted general character rule by expert system module, and passes through general character
Power failure data are analyzed and draw identification text by rule;
HDSP identification module will identify specialist system mould by text partition with filtration expert system module and Gen Yin
Block is analyzed and unrecognized power failure data carry out secondary analysis and obtain and arrive identification text.
The partition rule of text partition is first to split with comma, then splits the result split fullstop, finally
Split with branch again.The purpose of text filtering is exactly the unrelated composition after filtering out above-mentioned partition, mainly rule
As follows: 1, length filtering out less than 6;2, filtering out of the time descriptions such as year, month, day is only comprised;3、
If this has word filtering out in blacklist;If 4, this there being word to occur in white list not
Can filter.
As in figure 2 it is shown, root is because identifying what specialist system mainly carried out setting up according to RBES,
With rule base and factbase as core, by the man-machine interaction with user, domain expert and engineer, on rule
The stage that then obtains carry out constantly the creating of rule-test-perfect-test-perfect-...-iterative process that updates, logical
Cross the logical relation of rule-based reasoning in the clear and definite rule base of inference machine, and by explanation module by specialist system identification
During the result of output carry out the related description of matched rule of correspondence, in order to user carries out rule match knot
The artificial judgment of fruit.
Root also includes Rule unit, rule base unit and factbase unit because of identification expert system module, as
Fig. 3;
Rule unit extracts general character rule, by the property of general character rule to through partition and the power failure data filtered
Can parameter compare with the first threshold being set in advance in rule base unit, when the performance ginseng of general character rule
When the accuracy rate that number identifies is higher than the accuracy rate of first threshold, then by the performance parameter of this general character rule with true
Second Threshold in library unit carries out accuracy rate and compares, if the accuracy rate of the performance parameter of this general character rule is higher than
The accuracy rate of Second Threshold, then by these general character Policy Updates to rule base, otherwise, then continue to optimize this altogether
Property rule, until this rule meets update condition.
Include the coupling word for different safety at power cut identifications in rule base unit, by this general character rule with mate
Word carries out mating and draw the identification text that these power failure data are corresponding;
Industry background knowledge, initial text data, later stage labeling data and at root are included in factbase
Because identifying the recognition performance data produced in expert system module running.
Industry background knowledge includes:
1, general power grid accident type analysis in recent years
1.1, by causality classification: from the point of view of the reason that power grid accident occurs, cause the main of general power grid accident
Because have: relay protection, vile weather, external force destruction, maloperation, quality are bad, personnel's responsibility and
His reason.
1.2, by responsibility category: power grid accident can be divided into by responsibility category: natural disaster, workmanship,
External force destruction, operations staff, detail design, personnel's responsibility and other.According to statistics, natural disaster (thunderbolt,
Mist dodges, icing waves), personnel's responsibility (operations staff and other staff's responsibility), external force destroys and manufactures matter
Amount is the prime responsibility reason of general power grid accident successively.
1.3, by technique classification: power grid accident then can be divided into by technique classification: relay protection, thunderbolt,
Ground short circuit, pernicious maloperation, by mistake touch malfunction, equipment fault and other.Wherein, ground short circuit (break by external force
Bad, electric discharge over the ground), relay protection (false protection, relay fail, secondary circuit failure etc.) and thunderbolt be structure
Become the major technique reason of general power grid accident.
1.4, by device class: power grid accident generally can be divided into by device class: transmission line of electricity, relay protect
Protect, other electrical equipment, switch, disconnecting link, combined electrical apparatus etc..Practice have shown that, transmission line of electricity, relay protection depend on
The secondary capital equipment reason being to cause power grid accident.
Such as, an initial text data:
On February 6th, 2015, through white sand power supply station of Yangxin County electric company, the long Hu Weihua of outside line every class examines,
Amounting to frequency of power cut in the time period of this customers' responsiveness is 3 times, and the reason that causes power failure is specific as follows: 1, white 16
Branch line victory star main road platform district, blue or green waterline fiber crops garden migrates, and stops electric power feeding time: 2015-01-1308:20-16:25;2, white
16 branch line victory star main road 2# platform district, beam public affairs paving hemp gardens increase distribution transforming newly and take fire, stop electric power feeding time: 2015-01-27
08:20-18:05;3, relating to fault ticket: 2015020542186467, safety at power cut is: the low total sky in platform district
Open tripping operation, stop electric power feeding time: on February 5th, 2015 20:07-21:08, but the reason caused power failure is taked
Repairing reset mode solves, and replys safety at power cut to client (15272057988), and client understands.
Later stage labeling data: the label that above-mentioned example is corresponding is scheduled outage, scheduled outage, fault outage.Know
Other performance data includes: the label of above-mentioned example Model Identification is scheduled outage, scheduled outage, fault outage.
The label of now text identification is the most correct.
Containing a large amount of satisfactory rule in rule base, its format content mainly comprises following two pieces:
1) determination of the bound symbol between each coupling word:
The process of rule match, will coupling word in rule-based knowledge base and corresponding content of text be carried out
Join.Obviously the process of coupling can exist and comprises accordingly, do not comprises, comprises simultaneously, only comprises one
Etc. situation, therefore during coupling, need the corresponding relation showing to mate word with content of text.Therefore
Determine coupling word between bound symbol time, based in above-mentioned matching process it is possible that various feelings
Condition, establishes bound symbol as shown in table 1.
Table 1 mates bound symbol explanation between word
Annotation: a coupling word can only be connected after each bound symbol, to by multiple coupling words even
Connect, use bound symbol A word+space+bound symbol B contamination to carry out the foundation of rule.
2) determination of alternative symbol between classification:
Owing to, during setting up rule, rule category is carried out dividing foundation by we,
So inherently there is the situation of certain alternative in the rule that can there are two classifications.Therefore for one
For individual content of text (if A, B two class mutual exclusion), as shown in table 2, if being judged as A class,
Then can not be judged as B class, then the rule symbol of this situation is determined by we.
Table 2 alternative symbol description
Root is because identifying that expert system module also includes inference machine, man-machine interaction unit and Interpretation unit;Inference machine
The logicality relation inference of the rule-based reasoning in rule base unit, man-machine interaction unit includes man-machine interaction
Interface, engineer by human-computer interaction interface carry out rule base unit and factbase unit data improve go forward side by side
The Rule that row is new, the recognition result of safety at power cut is presented on alternating interface between man and computer directly by Interpretation unit
It is presented to user.
HDSP is mainly based upon the topic model of LDA algorithm, and has merged the most further and have prison
Superintend and direct classification learning algorithm, so that this algorithm can also carry out the autonomic learning of label when extracting theme simultaneously.
Tradition judges that two documents are the most similar, and simplest way adds up the word that two documents jointly comprise
Quantity, such as: TF-IDF.But this method does not also take into account the semantic component that word is comprised, thus can miss
Two documents that the quantity of the word sentencing semantic similitude but jointly comprise is little.Therefore when judging document similarity also
The semantic component of document itself need to be considered, and be directed to semantic excavation and mainly use topic model.At theme
In model, theme can be a concept, an aspect, can also be to comprise a series of relevant word simultaneously
Set, be the conditional probability of these words.Generally speaking, theme contains many phases strong with this theme exactly
The word of closing property (it is high that document comprises probability).
Unidentified power failure data are extracted based on LDA algorithm and are divided into training literary composition by HDSP identification module
Basis and test text, by the training of training text is drawn performance parameter, then surveyed by test text
Try and draw the performance parameter that recognition accuracy is higher, use performance parameter to generate and identify text and to remaining
Unidentified power failure data carry out the identification of safety at power cut.
Showing that two control parameter alpha and β by training algorithm training, (α is the parameter that p (θ) is distributed, and is used for
Generate a theme θ vector;β is word probability distribution matrix p that each theme is corresponding (w | z)), by controlling
Parameter alpha and β have determined that topic model and have generated identification text, and the algorithm of model generation identification text is as follows:
Choose parameter θ~P (θ);
Foreach of the N words wn:
Choose a topic zn~p (z | θ);
Choose a word wn~p (w | z);
Wherein:
θ: theme vector, each column represents the probability that each theme occurs at document
The Dirichlet distribution of p (θ): θ
N: the number of the word of document to be generated
wn: the n-th word w of generation
zn: the theme of selection
P (z | θ): the probability distribution of theme z during given θ
P (w | z): the distribution of word w during given theme z
How topic model problem to be solved is for generate theme.For this problem, topic model is raw
Model is become to connect document and theme.Generate model, i.e. suppose each word of every article be by " with
Certain certain theme of probability selection, and further in this theme with certain word of certain probability selection "
Process obtains.Therefore for a document, the probability that each word that it is comprised occurs is:
This new probability formula can represent with matrix:
Formula one:
Wherein " document-word " matrix represents the word frequency of each word in each document, the probability i.e. occurred;" main
Topic-word " matrix represents the probability that each word in each theme occurs;" document-theme " matrix represents each literary composition
The probability that in Dang, each theme occurs.Given a series of document, by document is carried out participle, calculates each
In document, the word frequency of each word can be obtained by " document-word " matrix on the left side.According to the title in formula one
Illustrate: the set that word is made up of a series of words, this set comprise heavy rain, heavy rain, magpie,
Nest, household electrical appliances, a series of words occurred in document such as electric leakage, word here is text word segmentation processing
Rear acquisition;Document the most frequently has a power failure data content, as strong wind and heavy rain weather cause power failure, user household electrical equipment
Electric leakage such as causes power failure at the content of text;Theme is frequent safety at power cut, has natural disaster, artificial external force, use
Family equipment fault, bird pest etc., these are all each themes under frequent safety at power cut;Theme vector: be exactly
The set being made up of each theme mentioned above.
First this method selectes a theme vector θ, here as a example by frequent safety at power cut, its correspondence
Type of theme is natural disaster, scheduled outage, bird pest, artificial external force etc., and these type of theme gather into one
Set, this set is exactly theme vector, and the element in theme vector is exactly above-mentioned each described theme class
Type, it is then determined that the selected probability of each theme.Then generating each word when, from theme distribution
Vector θ selects a theme z, is distributed by the Word probability of theme z and generates a word.Understand associating from the graph
Probability is:
Being combined by above formula corresponds on figure, can substantially be interpreted as shown in Figure 4 by figure below, topic model
Three represent that layer is showed by table 3:
Table 3 image parameters explanation
By discussed above, it is known that topic model is mainly from given input language material learning training two
Individual control parameter alpha and β, learn the two control parameter and determined that model, just can be used to generate literary composition
Shelves.
DSP identification module create one based on topic model improve model, consist predominantly of supervised classification and
Without supervision two aspects of Subject Clustering.Three below process is the generation process of this model Supervised classification:
1) by sampling a certain amount of power failure data of acquisition as sample data, this sample data is carried out manually
Safety at power cut label labelling, and the data after labelling are divided into training and test two parts;
2) use the training sample data of labelling that HDSP model is trained, utilize us to train
HDSP model carries out safety at power cut identification to test sample, and the safety at power cut result of output model identification, such as figure
Shown in 5, the method for training is:
1. pair document content carries out word segmentation processing, is calculated the probability that each word occurs in a document,
In conjunction with formula one, we have obtained " document-word " matrix.
2. initiation parameter α, β, " document-theme " matrix, " theme-word " matrix.
3. utilize " theme-word " matrix in β, " document-theme " matrix calculus document.
4. utilize α, " theme-word " matrix calculus " document-theme " matrix.
5. utilize result " theme-word " the matrix update parameter beta of step 3.
6. utilize result " document-theme " the matrix update parameter alpha of step 4.
The most repeatedly performing above-mentioned steps 3-6, until convergence, then training terminates.
3) according to test result, the test text recognition result of the HDSP model output trained is marked with artificial
The test sample result of note compares statistics, is calculated the accuracy rate of HDSP Model Identification safety at power cut,
First meeting sets the threshold value of a recognition accuracy, is performance standard, by comparing the accurate of test result
Rate and the size of threshold value, it is known that whether "current" model reaches performance standard, without reaching performance mark
Standard, the most constantly adjusts model parameter, and repeats 2) process, when test result reaches performance standard,
Preserve the model file after training.
Additionally, HDSP identification module can also carry out Unsupervised clustering to theme.During cluster, calculate
Method can generate the cluster labels of each example according to the descriptor number set and theme number.Meanwhile,
The cluster labels generated artificially is intervened by this algorithm support.After human intervention, algorithm can be learned automatically
Practise the label knowledge intervened, re-training model, then text is re-started cluster.Along with entering of iteration
OK, the precision of algorithm cluster also can be more and more higher.
In a word, the foregoing is only presently preferred embodiments of the present invention, all made according to scope of the present invention patent
Impartial change with modify, all should belong to the covering scope of patent of the present invention.