CN107992597B - Text structuring method for power grid fault case - Google Patents

Text structuring method for power grid fault case Download PDF

Info

Publication number
CN107992597B
CN107992597B CN201711325919.8A CN201711325919A CN107992597B CN 107992597 B CN107992597 B CN 107992597B CN 201711325919 A CN201711325919 A CN 201711325919A CN 107992597 B CN107992597 B CN 107992597B
Authority
CN
China
Prior art keywords
text
state quantity
entity
attributes
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711325919.8A
Other languages
Chinese (zh)
Other versions
CN107992597A (en
Inventor
杨祎
马艳
白德盟
胡博
闫丹凤
郭诗瑶
辜超
郭志红
陈玉峰
李贞�
朱振华
林颖
李程启
秦佳峰
郑文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Beijing University of Posts and Telecommunications, Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201711325919.8A priority Critical patent/CN107992597B/en
Publication of CN107992597A publication Critical patent/CN107992597A/en
Application granted granted Critical
Publication of CN107992597B publication Critical patent/CN107992597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text structuring method for a power grid fault case; carrying out named entity recognition on the unstructured text, and constructing an entity dictionary facing to the field of power grids to assist entity recognition and text word segmentation; extracting attribute values and state quantities describing the attributes, wherein the state quantities are divided into digital state quantities and non-digital state quantities according to types, and extracting and matching the modified attributes of the digital state quantities based on a rule method; refining the non-digital state quantity, dividing the non-digital state quantity into a state quantity based on a phrase form and a state quantity based on a sentence form, and respectively extracting modified attributes of the state quantity; and finally generating a plurality of binary groups formed by the attributes and the corresponding state quantities according to the identified attributes and the corresponding state quantities to finish text structuring.

Description

Text structuring method for power grid fault case
Technical Field
The invention relates to a text structuring method for a power grid fault case.
Background
In the overhaul and maintenance links of the power system, a large number of fault case reports are accumulated in a power grid enterprise, the texts comprise overhaul test records, routing inspection defect elimination records, fault problem description, fault reason description and the like, the texts are mainly presented in an unstructured form, and the description information is organized by the specification and logic of natural language and does not have a predefined data model or a character template.
For the description of various equipment states scattered in different places of a text, the purpose of structuring the text information is to obtain a state quantity for describing equipment faults through analyzing and processing non-structural texts and fill the state quantity into a predefined data model. The existing text structuring method is usually only structured information extraction of some general attributes, such as: time, place, people, specific relationships, etc.
The general method has low availability in the field of power grids, because the attributes in the power failure case text have the characteristics of the field, and the state quantity describing the attributes can be divided into two types from the type, wherein one type is a digital type, such as voltage level, failure time and the like, and the other type is a non-digital type, and the state quantity is described in words, such as failure phenomena, causes and the like. Therefore, different methods are required for analyzing different types of state quantities, and the current text structuring method obviously does not perform differential analysis.
Disclosure of Invention
In order to solve the problems, the invention provides a text structuring method for power grid fault cases, and the text structuring method can be used for specifically and comprehensively describing each attribute state quantity designed in each power fault case text.
In order to achieve the purpose, the invention adopts the following technical scheme:
a text structuring method for a power grid fault case comprises the following steps:
(1) carrying out named entity recognition on the unstructured text, and constructing an entity dictionary facing to the field of power grids to assist entity recognition and text word segmentation;
(2) extracting attribute values and state quantities describing the attributes, wherein the state quantities are divided into digital state quantities and non-digital state quantities according to types, and extracting and matching the modified attributes of the digital state quantities based on a rule method;
(3) refining the non-digital state quantity, dividing the non-digital state quantity into a state quantity based on a phrase form and a state quantity based on a sentence form, and respectively extracting modified attributes of the state quantity;
(4) and finally generating a plurality of binary groups formed by the attributes and the corresponding state quantities according to the identified attributes and the corresponding state quantities to finish text structuring.
Further, in the step (1), in a corpus labeling stage, a dictionary matching method is adopted to automatically label, and an entity dictionary is continuously perfected by a semi-supervised named entity recognition method based on CRF + +.
Further, the method specifically comprises the following steps:
(1-1) constructing an initial seed entity dictionary;
(1-2) constructing a training set for training CRF +: automatically labeling by adopting a method based on complete matching, labeling an entity word in an entity dictionary as a named entity when the word appears in a fault case text, and converting the named entity into a format of a CRF + + training file after completing the automatic labeling of the named entity so as to facilitate subsequent model training;
(1-3) training the constructed training corpus by using a CRF + + tool, predicting the test corpus after model training is finished, and finding a new named entity;
(1-4) screening the identified new entities, and adding the screened new entities into an entity dictionary for entity expansion;
(1-5) repeating steps (1-1) - (1-4) until no new valid entities can be found.
Further, in the step (1), the description attribute includes a faulty device, a fault phenomenon, a fault cause, a fault time and/or a voltage level.
Further, in the step (2), for the attribute described by the numeric type state quantity in the text, a rule-based method is adopted for extraction.
Furthermore, the mathematical units are identified by artificially constructing a mathematical unit mapping dictionary, and the mathematical units in the character form are extracted after being completely consistent with the mathematical units in the dictionary; for mathematical symbols, the difference of upper and lower cases is ignored for extraction, after a mathematical unit is extracted, numbers are identified based on a regular expression, and a rule is adopted for matching the numbers and the mathematical unit.
Further, in the step (3), the non-numeric state quantity refers to a text state quantity.
Further, in the step (3), the state quantity based on the phrase form is substantially named entities, the named entities in each case text statement are identified through a conditional random field model, and words modified by the named entities are extracted by using a Stanford Parser tool.
Furthermore, by constraining the matching conditions through the set grammar modification rules, the named entity must find the nearest modified noun along the syntax tree, and the noun and the named entity form a binary group.
In the step (3), in the structured process, the description of the fault description, the fault analysis process, the analysis conclusion, the fault processing method, the field condition, the suggestion and countermeasure state quantity is sentences in the fault case text, the extraction problem of the description is converted into the classification problem aiming at the sentences, and the bidirectional LSTM and the classification model are constructed to complete the matching of the sentence state quantity and the description attributes.
Compared with the prior art, the invention has the beneficial effects that:
1. the method and the device can describe the designed attribute state quantities in each power failure case text specifically and comprehensively.
2. Experiments prove that the method is effective for text structuring of the power grid fault case, and has the field characteristic of better matching with attributes in the text of the power fault case.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a text structuring flow chart;
FIG. 2 is a chain structure diagram of a conditional random field;
FIG. 3 is a flow chart of a CRF + + based semi-supervised algorithm;
FIG. 4 is a flow chart of digital type state quantity description extraction;
FIG. 5 is a flow chart of non-numeric state quantity description extraction;
FIG. 6 is a developed RNN map;
FIG. 7 shows that the duplicate blocks in a standard RNN comprise a single layer;
FIG. 8 is a layer diagram of a repeating module in an LSTM containing four interactions;
FIG. 9 is a structural view of Bi-LSTM;
FIG. 10 is a diagram of a Bi-LSTM fault case text sentence classification model;
FIG. 11 is a graph of the Loss variation on the training set and validation set for BLSTM _ drop and BLSTM _ nodop as the number of times trained increases;
FIG. 12 is a graph of the Loss variation on the training set and validation set for BLSTM _ drop and BLSTM _ nodop as the number of times trained increases;
FIG. 13 is a graph of F1_ score changes on the training set and validation set for BiLSTM and LSTM as the number of times trained increases;
the specific implementation mode is as follows:
the invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In the present invention, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only terms of relationships determined for convenience of describing structural relationships of the parts or elements of the present invention, and are not intended to refer to any parts or elements of the present invention, and are not to be construed as limiting the present invention.
In the present invention, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be determined according to specific situations by persons skilled in the relevant scientific or technical field, and are not to be construed as limiting the present invention.
A method for structuring a power failure case text specifically and comprehensively describes each designed attribute state quantity in each power failure case text.
In order to achieve the above object, an embodiment of the present invention provides a method for text structuring of a power failure case, including:
in the stage of marking the training corpus, considering the large workload of manual marking, a dictionary matching method is adopted to automatically mark, and an entity dictionary is continuously perfected by a CRF + + based semi-supervised named entity recognition method.
For attributes described by numeric type state quantities in text, for example: time, number, temperature, and various electromechanical indicators are extracted using a rule-based method. The mathematical units are identified by artificially constructing a mathematical unit mapping dictionary, for the mathematical units in the character form, the mathematical units are required to be completely consistent with the mathematical units in the dictionary for extraction, and for the mathematical symbols, the difference of upper case and lower case is ignored. After extracting the mathematical units, the numbers are identified based on regular expressions and rules are employed to match the numbers to the mathematical units. In order to obtain the modification relationship between the state quantity and the attribute, the text statement needs to be parsed, and the state quantity and the modification attribute are matched based on the analysis result, so that the final (attribute, state quantity) binary can be obtained.
The non-digital state quantity mainly refers to a text state quantity, and two expression forms are also provided in the power grid fault description, wherein one expression form is based on a phrase form, and the other expression form is based on a sentence form.
A phrase-based state quantity is essentially a named entity, such as a device name, a manufacturer name, and so forth. Named entities in each case text statement can be identified through a conditional random field model, and words modified by the named entities can be extracted through a Stanford Parser tool. By formulating a grammar modification rule to constrain the matching criteria, the named entity must find the nearest modified noun along the syntax tree, which forms a doublet with the named entity.
In the structured process, in the fault case text, the descriptions of 7 state quantities, namely fault description, fault analysis process, analysis conclusion, fault processing method, field condition, suggestion and countermeasure, are sentences, so that the extraction problem of the descriptions is converted into the 7 classification problems aiming at the sentences. And constructing a bidirectional LSTM and classification model to complete the matching of sentence state quantity and description attributes.
The embodiment of the invention provides a method for structuring a power failure case text, which comprises the steps of firstly analyzing the state quantities of different types in the power failure case text as a digital type, a word group type and a sentence type respectively, and then formulating different structuring methods aiming at the state quantities of different types, so that a specific and comprehensive structuring result of the power failure case text can be obtained.
In the power grid fault description text, fault description information is presented in an unstructured form, and the description information is organized by specifications and logics of natural languages without predefined data models or character templates. For the description of various equipment states scattered in different places of a text, the purpose of structuring the text information is to obtain a state quantity for describing equipment faults through analyzing and processing non-structural texts and fill the state quantity into a predefined data model. The data model R is described in a binary fashion:
R={(x,y)|x∈A,y∈S}
where A represents a predefined descriptive attribute and S is the state of the attribute that needs to be mined and analyzed from unstructured text. In the present subject matter, the description attribute includes a malfunctioning device, a malfunction phenomenon, a cause of malfunction, a time of malfunction, a voltage class, and the like. These state quantities describing attributes can be classified into two types, one is a digital type, such as voltage level, failure time, etc., and the other is a non-digital type, and the state quantities, such as failure phenomena and causes, etc., are described in words. Therefore, different methods need to be used for analyzing the different types of state quantities. The overall flow of the text structuring process is shown in fig. 1.
The method of the embodiment of the invention comprises the following steps:
a1, conducting named entity recognition on the unstructured text, and constructing an entity dictionary facing to the power grid field to assist entity recognition and text word segmentation. Specifically, the embodiment of the invention provides a semi-supervised iteration method based on CRF + +, which continuously optimizes the entity recognition effect and expands the dictionary.
A2, extracting the attribute value and the state quantity describing the attribute. The state quantity can be divided into digital state quantity and non-digital state quantity according to types, and aiming at the digital state quantity, the embodiment of the invention provides a rule-based method for extracting and matching the modification attribute of the digital state quantity.
A3, for the non-numeric type state quantity, the present embodiment divides it into a state quantity based on a phrase form and a state quantity based on a sentence form. For the state quantity based on the phrase form, the present embodiment adopts a method based on the grammar modification rule to extract the modified attribute, and for the state quantity based on the sentence form, the present embodiment adopts a method combining word embedding and two-way LSTM (long-short-time memory) to classify the sentence, so as to determine the modified attribute.
A4, recognizing the attributes and the corresponding state quantities through the method, and finally generating a plurality of (attribute, state quantity) duplets, thereby completing text structuring.
The method of the present embodiment will be described in detail below.
1. Named entity recognition
Named entity recognition refers to recognition of entities with specific meaning in text, and commonly includes names of people, places, organizations and the like. Named entity recognition is a key technology in natural language processing, is the basis of text information understanding and processing, and has an important position in research branches such as information extraction, information filtering, question-answering systems, machine translation and the like. In different application scenarios, the named entity may be given different meanings, and in this embodiment, the named entity refers to basic attributes of a device failure that is not described by a numerical state quantity, such as a manufacturer, a failure description, a device type, and the like.
The main approaches to named entity recognition are generally divided into two categories, rule-based and machine learning-based approaches. In the rule-based method, the formulation of the rule is very dependent on domain experts, the design process of the rule is time-consuming and easy to generate errors, and all language phenomena are difficult to cover. Because of the large number of named entities, they cannot be fully enumerated, all are included in a dictionary, while certain types of entity names change frequently and do not follow strict rules, lacking strict naming specifications, and therefore new named entities are constantly generated over time. Therefore, the rule-based method has poor accuracy, coverage, portability and expandability, and is not suitable for the embodiment.
Compared with the prior art, the method based on machine learning utilizes the artificially labeled linguistic data to train and learn, broad linguistic knowledge is not needed when the linguistic data are labeled, most knowledge is acquired through a machine, objectivity is high, and meanwhile, once an effective language mode is learned, the method can be expanded to all linguistic data, so that a new named entity can be effectively identified, and the intelligent degree is high. Therefore, the embodiment adopts a machine learning method to identify the named entity, and the named entity identification task can be regarded as a sequence marking problem.
In the current natural language processing research, a conditional random field model (CRF) is a sequence labeling model which is most widely applied and has good performance, a conditional random field is a probabilistic structure model for labeling and dividing sequence structure data, and an exponential model is provided for the whole identification sequence under the condition of determining an observation sequence. Compared with a hidden Markov model and a maximum-picking Markov model, the method can depend on more information in a long distance, can well utilize context information in sentences, and can make more accurate judgment on the words to be recognized through the related information. The conditional random field solves the data induction bias problem well when faced with Chinese chunks and makes strict independence assumptions. By defining a plurality of characteristic templates, the identification accuracy is greatly improved. Compared with other models, the conditional random field model has better performance than other probability models under the condition of using the same feature set, because the conditional random field model has strong reasoning capability and can train reasoning by utilizing overlapped, non-independent and very complex features. The method can effectively utilize various kinds of information inside and outside, and the feature set of the model is very rich. The basic principle of conditional random fields will be described below.
1) Conditional random field
The conditional random field is a undirected graph discriminant model, and for a marker sequence Y and an observation sequence X to be estimated, the conditional random field needs to solve for a conditional probability P (Y | X), which is different from a model solving probability distribution P (X, Y) generated to describe the model. The conditional random field can also be viewed as a markov random field.
The conditional random field is defined as: let G be (V, E) an undirected graph, V a set of nodes, and E a set of undirected edges. Y ═ YvL V ∈ V, each node in V corresponds to a random variable YvAnd the value range mark set is { y }. When we condition the observation sequence X, we require any one random variable YvThe following markov characteristics are satisfied:
P(Yv|X,Yω,ω≠v)=P(Yv|X,Yω,ω~v)
where w-v indicate that two nodes are adjacent nodes in graph G. Then (X, Y) is a conditional random field model. Modeling the sequences can form the simplest and most common chain structure diagram, with the nodes corresponding to the elements in the tag sequence Y and the chain structure as shown in fig. 2.
Obviously, there is no graph structure between the elements of the observation sequence X, since here the observation sequence X is taken as a condition and no independence assumption is made.
Given an observation sequence X, the probability of a particular marker sequence Y can be defined as:
Figure GDA0002443826540000071
wherein, tj(yi-1,yiX, i) is a transfer function representing the probability of the transfer of the marker sequence for the observation sequence X at the positions i and i-1; sk(yiX, i) is a state function representing the probability of a marker at its i position for the observation sequence X; lambda [ alpha ]jAnd mukAre respectively tjAnd skThe weights of (a) need to be estimated from the training samples.
In defining the feature function, a set of {0,1} binary features b (X, i) for the observation sequence may be defined to represent certain distribution features in the training sample.
Figure GDA0002443826540000072
The transfer function may be defined as:
Figure GDA0002443826540000073
for ease of description, the state function may be written as follows:
s(yi,X,i)=s(yi-1,yi,X,i)
thus the characteristic functions can be expressed uniformly as:
Figure GDA0002443826540000081
wherein each local feature function fj(yi-1,yiX, i) denotes the state characteristic s (y)i-1,yiX, i) or a transfer function t (y)i-1,yi,X,i)。
Thus, the conditional probability defined by a conditional random field can be given by:
Figure GDA0002443826540000082
where, denominator z (x) is a normalization factor:
Figure GDA0002443826540000083
2) parameter estimation and optimization
The solution to conditional random field model parameters may employ maximum likelihood estimation. Assume that there is a set of samples D ═ x for the training dataj,y j1,2 …, N, where the samples are independent of each other, and p (x, y) is the empirical probability of (x, y) in the training samples. For a certain conditional model p ═ (y | x, θ), the maximum likelihood function formula for the set of training data samples D can be expressed as:
Figure GDA0002443826540000084
in logarithmic form:
Figure GDA0002443826540000085
with specific reference to conditional random fields, the conditional probability can be expressed as:
Figure GDA0002443826540000086
where θ is { λ ═ λ12,λn12,μnAre the weight parameters that need to be evaluated.
The likelihood function of a CRF model can be expressed as:
Figure GDA0002443826540000087
where λ and μ are the parameters to be estimated and f and g are the vectors of the eigenfunctions.
For parameter lambdakDerivation:
Figure GDA0002443826540000091
the above formula value is 0 to satisfy partial constraint condition, and obtain lambdak. Mu can also be obtained by the same methodkThereby resolving θ.
The conditional random field model is one of the most popular natural language models at present, overcomes the mark bias problem of the maximum entropy Markov model, and has the advantage of long-distance dependence.
3)CRF++
The present embodiment utilizes a conditional random field model for named entity recognition and selects CRF + + as a CRF implementation tool because it is one of the most popular CRF tools today.
CRF + + is a simple, customizable and open source CRF tool that can be used for continuous data annotation, and has a toolkit executable under the windows system.
The CRF + + input files include three, which are training corpus, template file and test file, and the final result is output to the designated file.
a) Training file, test file and result file formats
The training file is composed of a plurality of sentences (which can be understood as a plurality of training examples), and different sentences are separated by line breaks. Each sentence can have several sets of labels, the last set of labels is a label, the first column is the word sequence of the sentence, table 1 is an example of entity labels for equipment manufacturers, and includes two columns of data, the first column is known corpus, and the second column is entity labels. The labels are divided into three categories, with "B" and "I" indicating the beginning and continuation states of the named entity, respectively, and "O" indicating unrelated word sequences. In the table below, "west ampere cable plant" is labeled "biii". The different sentences are separated by line breaks at the end of the sentence.
TABLE 1 Equipment vendor entity recognition corpus example
Figure GDA0002443826540000092
Figure GDA0002443826540000101
The test file format is the same as the training file format. And after the test file is subjected to named entity recognition through a CRF + + tool, outputting a result to a specified file. The difference between the result file and the test file is that the result file has one more column, namely the predicted label, and the accuracy of named entity identification can be finally calculated by comparing the predicted label with the labeled label. Table 2 shows an example of the result, and it can be seen that CRF + + identifies an unlabeled equipment manufacturer "Jiangxi transformer factory", so CRF + + has a certain identification capability on a new named entity.
Table 2 device vendor entity identification result example
Training corpus Labeling Results
Master and slave O O
Become O O
By O O
River (Jiang) O B
Western medicine O I
Become O I
Press and press O I
Device for cleaning the skin O I
Plant (S.A.) O I
Raw material O O
Product produced by birth O O
Template file format
In order to fully consider the phenomenon of language and generate a model reflecting the intrinsic rules of language, features can be screened through some templates. The template is a consideration of the specific location and specific information of the context. The greatest advantage of the CRF model is that it can use various context information or external features. The feature template has the function of providing a uniform mode for generating the feature function, and all the feature functions required by people can be conveniently obtained through the use of the feature template. The formulation of the feature function template is directly related to the generation of the feature function, and a flexible feature selection mode is provided. The feature templates used in this example are shown in table 3.
TABLE 3 named entity recognition feature template
Feature template type Characteristic template
Unigram U00:%x[-2,0]
Unigram U01:%x[-1,0]
Unigram U02:%x[0,0]
Unigram U03:%x[1,0]
Unigram U04:%x[2,0]
Unigram U05:%x[-2,0]/%x[-1,0]/%x[0,0]
Unigram U06:%x[-1,0]/%x[0,0]/%x[1,0]
Unigram U07:%x[0,0]/%x[1,0]/%x[2,0]
Unigram U08:%x[-1,0]/%x[0,0]
Unigram U09:%x[0,0]/%x[1,0]
Bigram B
In the feature templates of the upper table, each feature is distinguished by using a label in each template, for example, in a Unigram, using "U00: the marks of the form are distinguished, the form of "% x [ row, col ]" is adopted when the feature is written, row represents the relative line number with the current position, col represents the relative column number with the current position, in the feature template, only context information is selected as the feature, so that when the training text is marked, only two columns of data of the text and the entity category are provided, therefore, when the feature template is written, the relative column number col is 0, and only row is changed. The U05-U09 features combine unary features into binary or ternary features. Only one binary feature Bigram is written in the template, the expression form of the binary feature Bigram is B, the two adjacent features are freely combined, and when the binary feature B appears in the feature template, the feature template is applied to training data, so that the freely combined features of two adjacent rows, such as [ -2,0]/[ -1,0], are automatically generated.
4) CRF + + based semi-supervised named entity recognition algorithm
In order to construct a named entity dictionary for grid fault cases, the named entity classes to be identified are first determined. In consideration of the diversity of the text description of the power grid fault, the attributes of different devices cannot be unified. It is therefore necessary to build artificial templates to predefine the properties that different devices need to extract. By observing the fault text case, the device will be divided into six categories, and the attributes to be extracted for each category are represented as the following table:
table 4 device failure attribute template example
Figure GDA0002443826540000111
Figure GDA0002443826540000121
By formulating an attribute template, irrelevant attributes can be filtered out, and the structuralization of the case text becomes more targeted. In the embodiment, the effect of named entity identification is continuously iterated and optimized by providing a CRF + + based semi-supervised algorithm, and the effectiveness of the method is proved through experiments. As shown in fig. 3, the specific steps are as follows:
a1, manually selecting a small part of entities, and constructing an initial seed entity dictionary according to the small part of entities;
and A2, constructing a training set for training CRF + +. In consideration of the large workload of manual labeling, the embodiment adopts a method based on complete matching to perform automatic labeling. Specifically, when an entity word within the entity dictionary appears in the fault case text, the word is labeled as a named entity. After the automatic labeling of the named entity is completed, the named entity is converted into a format of a CRF + + training file so as to facilitate subsequent model training.
A3, training the constructed training corpus by using a CRF + + tool, predicting the test corpus after model training is completed, and finding a new named entity.
And A4, manually screening the identified new entities, and adding the screened new entities into an entity dictionary for entity expansion.
A5, repeating the above steps until no new effective entity can be found.
Semi-supervised algorithm experiment based on CRF ++
The effectiveness of the CRF + + based semi-supervised named entity recognition algorithm is confirmed by experiments in the embodiment. The experiment takes the identification of the fault description entity as an example, and the main experimental steps and experimental results are as follows:
a1, manually constructing a fault description entity dictionary, and containing 1406 fault description entities in total.
Table 5 fault description entity dictionary example
Figure GDA0002443826540000122
Figure GDA0002443826540000131
And A2, for each fault case document, segmenting the content according to the periods to obtain a statement set C. And automatically labeling all the sentences in the C in a complete matching mode. The power grid fault text case document used in the experiment has 1296 total words of 2518762. In the experiment, 70% of case documents are used as a training set, 30% of case documents are used as a test set, and the statistical results of training corpora and test corpora are as follows:
TABLE 6 statistics of corpus
Number of faulty entities marked Sum of labelsNumber of words Non-repeating number of faulty entities
14153 34048 234
Table 7 statistical results of test corpus
Number of faulty entities marked Total number of words marked Non-repeating number of faulty entities
5919 14382 177
A3, training a model and identifying an entity by using CRF + +. In this embodiment, all consecutive "BI" tags are counted, and the corresponding named entity is extracted, wherein during the extraction, the beginning of the "B" tag must be strictly started, and the subsequent "I" tag is considered to be valid.
The evaluation indexes adopted in the experiment are accuracy and recall rate. The accuracy P is for the prediction result, which indicates how many of the samples predicted to be positive are true positive samples. Then there are two possibilities to predict positive class (TP) and negative class (FP), namely:
Figure GDA0002443826540000132
the recall rate R is for our original sample and indicates how many of the positive examples in the sample are predicted to be correct. There are also two possibilities, one to predict the original positive class as a positive class (TP) and the other to predict the original positive class as a negative class (FN).
Figure GDA0002443826540000133
The results of the primary identification are as follows:
TABLE 8 initial named entity recognition results
Figure GDA0002443826540000134
Figure GDA0002443826540000141
In this embodiment, the recognized entities are divided into entry words and unknown words, and the entry words are words that have already appeared in the training set and are already included in the existing dictionary. The unknown word is not found in the training set, but only in the test corpus, so that the unknown word can be judged whether to be a new entity according to whether the word is contained in the existing dictionary. The result statistics is performed on the login words and the unknown words respectively, wherein the new entity needs to be manually judged to be correct or not.
TABLE 9 Total statistics of unknown words
Identifying a total quantity 52
Identifying the correct quantity 49
Recognition accuracy 94.23%
Identifying number of non-duplicate entities 25
Identifying correct number of non-duplicate entities 24
Identifying non-duplicate entity accuracy 96%
TABLE 10 statistics of unknown words contained in the dictionary
Identifying a total quantity 32
Identifying the correct quantity 32
Recognition accuracy 100%
Identifying number of non-duplicate entities 15
Identifying correct number of non-duplicate entities 15
Identifying non-duplicate entity accuracy 100%
TABLE 11 statistics of unknown words not contained in the dictionary
Figure GDA0002443826540000142
Figure GDA0002443826540000151
Table 12 registered word recognition results
Number of Correctly recognizing quantity Accuracy rate
5853 5799 99.08%
Table 13 example of unknown words not contained in a dictionary
Water inlet of relay
Short circuit on fire
Unqualified grounding
Joint gas leak
And A4, manually screening the identified new entities. Unknown words that are not contained in the dictionary are identified as new named entities. The observation shows that the new named entities can be divided into two categories, one is the combination of existing entities, for example, fire and short are both contained in the fault entity dictionary, and fire and short are not contained. The other is a completely inexistent faulty entity, and in the primary identification result, there are three new entities in the category, namely "fail to ground", "bolt loose" and "broken", wherein the "bolt loose" is represented as a test method in the case document, rather than a fault, and is therefore manually judged as an identification error. And after the manual screening is finished, adding the newly identified entity into the existing entity dictionary for expansion and improvement.
Through the primary recognition result, the recognition effect is good whether the log-in words or the unknown words are found, the recognition accuracy is high for the log-in words because the log-in words appear in the training set, and the high accuracy for the unknown words shows that CRF + + has strong learning capability on the context information of the fault entity and the word specification of the CRF + + per se.
And A5, repeating the steps, wherein after 3 iterations, the CRF + + cannot find a new entity, and the named entity recognition is finished. Meanwhile, more adequate learning also increases the recall rate of unknown words from 56.14% to 65.75%. Finally, a total of 22 valid new words are identified.
Through the above experiments, it can be seen that the semi-supervised named entity recognition algorithm based on CRF + + provided in this embodiment is effective. The algorithm can successfully learn the context information and word specification of the entity finally through continuous iterative optimization. By named entity recognition, the attribute and the state quantity can be effectively extracted, and the purpose of text structuring is achieved.
Digital type state quantity description extraction process
The attributes described by the numerical type state quantities have time, number, temperature, and various electrical indicators. For this type of state quantity, the present embodiment extracts the state quantity by using a rule-based method, first extracts the numbers in each sentence through regular matching, then finds the units of the digital state quantity according to the grammar modification rule, merges the numbers and the mathematical units into a complete state quantity, and matches the attributes with the corresponding digital state quantities, so as to obtain the final (attributes, state quantities) binary group. The flow chart is shown in fig. 4.
Specifically, the method comprises the following steps:
a1, extracting numbers by regular matching. The regular expression is a logic formula for operating on character strings, and a 'regular character string' is formed by using specific characters defined in advance and a combination of the specific characters, and is used for expressing a filtering logic for the character string. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text. The embodiment completes the regular matching by constructing a digital regular expression.
A2, manually constructing a mathematical unit mapping dictionary. Different state quantities have own unique mathematical units as measuring standards, such as kilovolts or kv, temperature in centigrade and the like, while state quantities similar to the "numbering" attribute have no mathematical units but are simple numbers, so that it is necessary to construct an artificial mathematical unit mapping dictionary, wherein the mathematical units include both word forms and mathematical symbols. The unit mapping dictionary includes only the state quantities to be extracted for the topic. An example of a unit mapping dictionary is as follows:
table 14 Unit mapping dictionary example
Quantity of state Unit of
Device voltage class Kilovolt/volt/kv/v
Time of day Year/month/day
Temperature of Degree/degree centigrade// DEG C
Line number Is free of
A3, mathematical unit extraction. In order to extract the number units accurately, the present embodiment will adopt a perfect matching rule. As shown in the above table, the number unit has two expression modes, i.e., a text-based mode and a mathematical symbol-based mode, and the matching methods for the two modes are different. For the mathematical units in the character form, the mathematical units are required to be completely consistent with the mathematical units in the dictionary for extraction, and for the mathematical symbols, the difference of upper case and lower case is ignored, so that the omission caused by the difference of the upper case and the lower case of letters can be avoided.
A4, matching numbers and mathematical units. After the mathematical units are extracted, rules need to be formulated to match the numbers and the mathematical units, and according to the characteristics of the power grid fault text, the embodiment directly uses the proximity principle to match, namely the mathematical units must be immediately behind the numbers, the numbers and the mathematical units are considered to be successfully matched, and only the number state quantity successfully matched is considered to be an effective candidate state quantity, so that the interference of irrelevant numbers can be effectively avoided. For time matching, it is also necessary to combine successive "years", "months" and "days" to obtain the complete time state quantity.
Table 15 digital type state quantity extraction example
Figure GDA0002443826540000171
A5, matching state quantity and description attribute. After the matching of the numbers and the mathematical units is completed by the method, the state quantity of the number type can be obtained. Next, rules need to be formulated to extract the attributes described by the state quantities. In order to obtain the decorated relationship between the state quantity and the attribute, the text statement needs to be parsed and the state quantity and the decorated attribute are matched based on the analysis result. The syntax analysis tool used in this example was the StanfordParser tool developed by StanfordParser university.
Stanford Parser is a grammar parsing tool introduced by Stanford Parser group of the university of Stanford natural language, which can parse the sentence structure of a sentence, mark component labels for different components in the sentence, specifically mark part-of-speech labels for a word segmentation unit. For the input sentence, word segmentation work is required to be carried out in advance, and each word segmentation unit is separated by a space character. The Stanford Parser can display the whole structure of the sentence in a tree form, wherein leaf nodes are all participle units. Meanwhile, a plurality of interface functions for displaying the dependency relationship between word segmentation units in the sentence are provided, and the dependency relationship between two words in the sentence can be displayed. Therefore, by means of the Stanford Parser, the modification relation of each word in the sentence can be visually displayed. From the embellishment relationship, rules can be formulated to extract attributes embellished by state quantities.
TABLE 16 Stanford Parser example
Figure GDA0002443826540000172
The example of using the Stanford Parser to analyze the grammar is shown in the table, the left side of the table is the input of the Parser, the Parser is a text sentence with words already divided, and in the word dividing process, the extracted digital state quantity needs to be completely divided, so that the parsing result is prevented from being too fragmented and disordered. The right side of the table is the parsing result, the result is represented by a plurality of triples, each triplet includes a modifier, a term parsing component and a dependency relationship between the terms, and the parsing component and the dependency relationship are represented by a set of predefined symbols. Common syntactic analysis components include nouns (NN), verbs (VV), radical words (CD), Noun Phrases (NP), etc., and common dependencies include noun subjects (nsubj), direct objects (dobj), numeric modifiers (nummod), temporal modifiers (tmod), etc. Here, we only pay attention to the modifiers and whether the modifiers depend on each other, in order to extract the attributes modified by the state quantities. In the above example, it can be seen that the attribute modified by the state quantity "220 kV" is "rich iron line", and the rich iron line can be known to belong to the line entity through named entity identification, and the device entity voltage level and "kV" can be known to match each other through the unit mapping dictionary, so that the final binary group (rich iron line, 220kV) can be extracted.
Non-digital type state quantity description extraction process
As shown in fig. 5, the non-numeric state quantity mainly refers to a text state quantity, and there are two expression forms in the description of the power grid fault, one is based on a phrase form, and the other is based on a sentence form. Based on the phrase-form state quantity, the present embodiment uses a grammar modification rule to match the state quantity and its modification attribute, and based on the sentence-form state quantity, the present embodiment uses a machine learning method to perform classification.
Specifically, the non-digital state quantity description extraction method comprises the following steps:
and A1, extracting state quantities based on the phrase form. A phrase-based state quantity is essentially a named entity, such as a device name, a manufacturer name, and so forth. Named entities in each case text statement can be identified by the conditional random field model described above.
And A2, extracting the attribute modified by the phrase state quantity based on the grammar modification relation. Using the Stanford Parser tool, the words modified by the named entity can be extracted. Where the embodiment makes a grammar modification rule to constrain the matching conditions, the named entity must find the nearest modified noun along the syntax tree, and the noun and the named entity form a binary set, this rule is exceptional for the device class attribute, since the device class usually does not show an indication in the text, but gives the entity directly.
Table 17 phrase state quantity description extraction example
Figure GDA0002443826540000181
Figure GDA0002443826540000191
As can be seen from the above example, the zibo tai photovoltaic power equipment factory is taken as a factory entity, the word directly modified by the zibo tai photovoltaic power equipment factory is 'production', and the word modified by the 'production' is 'insulator', so that a binary group (insulator, zibo tai photovoltaic power equipment factory) is extracted. For the device type, "colossal line" can be seen as the device entity, and it is not indicated in the sentence that it is the device type, so the binary group (device type, colossal line) is directly extracted.
A3, matching the state quantity in sentence form with the described attribute by using a Bi-LSTM (bidirectional long-term memory) -based classification algorithm. In the process of text structuring, when a fault case text is found, the descriptions of 7 state quantities, namely fault description, a fault analysis process, an analysis conclusion, a fault processing method, a field condition, a proposal and a countermeasure, are all in a sentence level, so that the sentence extraction problem of the state description is converted into a 7 classification problem aiming at the sentences, and the matching of the sentence text state quantity and the description attributes is completed.
The construction of the automatic classification of Chinese sentences based on machine learning mainly comprises a training stage and a testing stage, wherein in the training stage, the sentences need to be preprocessed by word segmentation, word stop and the like; then, text representation is carried out, namely the text is converted into a form which can be recognized by a computer; and then selecting a proper classifier to classify the sentences, outputting a predicted result, and adjusting parameters to retrain if the result does not reach a certain standard. In the testing stage, the processing flow of sentence preprocessing and text representation is the same as that in the training stage, then the classifier obtained in the training stage is used for classification, and the classification result is compared with the given classification result, so that the performance of the classification model is evaluated, and reference is provided for follow-up. The method comprises the following specific steps:
and B1, segmenting the fault case text through a text segmentation tool. The word segmentation adopts the result word segmentation tool of python. The Chinese character input method based on the Chinese character input method is characterized in that the Chinese character input method is an open source library, is simple and easy to use, has high accuracy, and is widely used in the field of word segmentation. Because the power grid fault case documents are all organized by continuous natural language sequences, no obvious segmentation symbol exists among words, the purpose of text word segmentation is to divide the continuous word sequences into a plurality of words, and the method is one of the most common preprocessing methods in text mining. The following table gives an example of a text segmentation:
table 18 text participle example
Figure GDA0002443826540000201
And B2, adopting a better word Embedding method for the divided words to obtain a distributed text representation based on the word2vec tool.
In particular, a text representation is a numeric vector that can be processed by a computer by representing a text string in some form. The fundamental reason why text representation is performed is that the computer cannot directly process text strings and thus needs to be digitized or vectorized. The process is required by the traditional machine learning algorithm and the process is required by deep learning, but the process can be directly included in a deep learning network; meanwhile, the good text representation form can also greatly improve the algorithm effect.
The conventional classification model adopts a Vector Space Model (VSM) to vectorize texts, and its basic idea is to use vectors to realize the representation of sentences, and each column of the vector represents a feature of the sentence. According to different feature selection modes, various representation methods can be provided, such as word frequency-inverse document frequency (TF-IDF) representation, Boolean representation, Latent Dirichlet (LDA) representation and the like. The feature selection of the methods is based on the frequency of each word in the sentence without considering the sequence of the words, such as two texts of 'certain hot spot temperature is more than 110 ℃ and' certain hot spot temperature is less than 110 ℃, which are obviously different in sentence meaning, but are converted into the same sentence vector after being expressed by word segmentation and VSM. Meanwhile, the VSM cannot represent semantic correlation between words, for example, in the case of "conservator" and "conservator", the words are treated as two different words, and thus much semantic information is lost. Moreover, the character string itself cannot store semantic information, which brings challenges to NLP tasks.
The representation of words in a Neural Probabilistic Language Model (Neural Probabilistic Language Model) is vector-form, semantic-oriented. The vectors corresponding to two semantically similar words are also similar, and are reflected in an included angle or a distance. Even vectors obtained after linear subtraction is performed on vectors corresponding to words in some binary phrases with similar semantics are still similar.
From a vector perspective, words in the form of strings are actually higher-dimensional, sparser vectors. If the vocabulary size is N and the lexical ordering of each string is i, it is represented as an N-dimensional vector with the ith dimension being 1 and the other dimensions being 0. The Chinese vocabulary is about one hundred thousand, and the vector of one hundred thousand dimensions is absolutely a dimension disaster for calculation. And word2vec (the dimension can be freely controlled, which is generally about 100).
In this embodiment, the word2vec tool is used in a word vector representation manner using Distributed representation. The Distributed representation was first proposed by Hinton in 1986. The basic idea is to map each word to a K-dimensional real number vector (K is generally a hyper-parameter in the model) through training, and judge semantic similarity between words through distances between words (such as cosine similarity, euclidean distance, etc.). It adopts a three-layer neural network, input layer-hidden layer-output layer. The Huffman coding is used according to the word frequency, so that the activated contents of all word hiding layers with similar word frequencies are basically consistent, the number of the word hiding layers activated by the words with higher occurrence frequency is less, and the calculation complexity is effectively reduced. The three-layer neural network models the language model, but obtains the expression of a word on a vector space, and the side effect is the real target of the word2 vec. Compared with the classic processes of Latent Semantic analysis (LSI) and Latent Dirichlet Allocation (LDA), word2vec utilizes the context of words and Semantic information is richer. Word2vec adopts a hierarchical Log-Bilinear language Model, and supports two models, namely CBOW (Continuous Bag-of-Words Model) and Skip-gram.
The specific steps of acquiring the text representation in this embodiment include:
and C1, carrying out sentence segmentation, word segmentation processing and stop word filtering pretreatment on the fault case text, and storing the result according to sentences to obtain a sentence set.
In this embodiment, word segmentation and stop word filtering are mature word processing technologies, and are not described herein again. Sentence segmentation means that the text is segmented according to the sentence numbers to obtain a preprocessed sentence set.
C2, calculating a word vector representation of the word.
Training a word vector model based on a word2vec module of genim (a python natural language processing library) to obtain word vector representation of each word in the corpus, setting the word vector representation as 100 dimensions, wherein each dimension of the word vector represents semantic features of the words learned through a continuous skip-gram model.
C3 obtaining text representation of sentence from word vector
In the embodiment, a sentence is represented as a matrix of n × m, n represents the number of words in the sentence, and m represents the dimension of a word vector.
After the steps are carried out, the distributed text vector of the sentence can be obtained.
And B3, classifying the sentence vectors based on the Bi-LSTM algorithm.
The performance of the text classifier determines the classification effect of the text. Text classifiers can be generally classified into 2 broad categories: machine learning based classifiers and non-machine learning classifiers. The classifiers for non-machine learning are further classified into classifiers based on probability theory and classifiers based on TF-IDF weight. Common classifiers for non-machine learning include a naive Bayes algorithm and a K nearest neighbor algorithm. Machine learning-based classifiers are Support Vector Machines (SVMs), neural networks, decision trees, and the like. The classifiers have the advantages of low model complexity and high training speed; however, the classification effect is greatly influenced by manual feature selection because the model has shallow levels and cannot automatically extract features.
Common deep learning models in the field of text classification are an RNN model, an LSTM model and a Bi-LSTM model.
The RNN (recurrent neural network) model is a structure in which a recurrent structure is added to a conventional neural network, so that a time concept is introduced, a recurrent structure is executed every time t, and the state of a hidden layer node at the next time is updated according to the current input and the state of the hidden layer node at the previous time. The state of the hidden layer of the RNN model at each time uses the information of the previous node, and the state of the hidden layer of the last node stores the history information of all previous inputs. The developed RNN network is shown in fig. 6.
It can be seen that the input of the RNN model is a sequence, and it is information that the current node is predicted by using the previous input, that is, the output of the current node depends on the context information, as the input sequence becomes longer, the influence of the history node far away from the current node on the current node decreases, because the activation state of the hidden layer is updated and covered, the learning capability of the RNN model on the history information decreases, that is, the perception capability of the RNN model on the previous input far away decreases, which is a long-term dependence problem of the RNN model. Theoretically, the RNN model can solve the long-term dependence problem, but the input of long sequences also causes the problem of gradient explosion in inverse derivation, so that the RNN parameters are difficult to train.
Long Short Term networks, commonly called LSTMs, are a special type of RNN that can learn Long-Term dependency information. LSTM was proposed by Hochreiter & Schmidhuber (1997) and recently improved and generalized by Alex Graves. LSTM has enjoyed considerable success and widespread use in a number of problems. LSTM avoids long term dependency problems by deliberate design. Remembering long-term information is in practice the default behavior of LSTM, rather than the ability to be obtained at great expense.
All RNNs have a form of a chain of repeating neural network modules. In a standard RNN, this duplicated module has only a very simple structure, e.g. a tanh layer, as shown in fig. 7.
LSTM is also such a structure, but the duplicated modules have a different structure. Unlike a single neural network layer, there are four, interacting in a very specific way.
LSTM enables RNN models to remember long-term information by designing a special structure cell, and by designing 3 "gate" structures through which control information passes, information passing through the cell structure can be selectively added and removed. These three "gates" act on the cell to form the hidden layer of the LSTM model, also known as a block. The structure of these three doors is: the forgetting Gate (Forget Gate), the input Gate (input Gate), and the Output Gate (Output Gate) specifically function as follows:
forget the door: the history information is used for controlling the history information stored by the hidden layer node at the previous moment. The forgetting gate can calculate a value between 0 and 1 according to the state of the hidden layer at the last time and the input of the node at the current time, and the value acts on the state of the cell at the last time to determine which information needs to be reserved and discarded. Where 1 represents retention and 0 represents rejection. By the forgotten-mouth processing, the output (namely, the history information) of the hidden layer cell can be selectively processed.
An input gate: and the input is used for controlling the current hidden node layer cell state. The input information can be judged whether to update the input information to the cell state at the current moment through some operations, namely, the information which needs to be updated is judged, and the stored information is reserved. The output of the input port is a value between 0 and 1 of the sigmoid output, and then the input information is acted to determine whether to update the value corresponding to the cell state, wherein 1 represents that the information is allowed to pass, the corresponding value needs to be updated, 0 represents that the information is not allowed to pass, and the corresponding value does not need to be updated. It can be seen that the input port can remove some of the unwanted information.
An output gate: for controlling the output of the current hidden layer node. It is determined whether to output to a next hidden layer or an output layer. By controlling the output port, we can determine which information needs to be output. His state values are also 0 and 1. Where 1 indicates a required output and 0 indicates an unnecessary output. The control information of the output port is applied to the current cell state to find the final output value after certain processing.
As shown in FIG. 9, LSTM solves the gradient extinction and gradient explosion problems of standard RNN, and has the property of making the current cell available to the information prior to the present cell. However, the LSTM has the disadvantage that the information of the unit behind the unit cannot be obtained, so the BiLSTM neural network is generated, and the model diagram is shown in the figure. The BilSTM neural network is a further improvement of the common LSTM and has excellent effect in the task of sequence labeling. The basic idea is to use forward and backward LSTM to capture the hidden information contained in the past and in the future, and these two parts of information constitute the final output, so that the final semantic encoding depends on both the above information and the below information.
The steps of the Bi-LSTM-based classification algorithm used in the present embodiment and the experimental results will be described in detail through experiments.
D1, constructing a Bi-LSTM sentence classifier.
The superiority of Bi-LSTM in text processing is described above, so the Bi-LSTM classifier is used to classify sentences in this embodiment. Firstly, a five-layer bidirectional LSTM neural network is experimentally constructed, wherein the hierarchical structures of forward LSTM (Forward LSTM) and backward LSTM (Backward LSTM) are completely the same, and the network structure is shown in FIG. 10.
The design of each layer is as follows:
e1, setting an input shape of the first layer of the network as (600, 50, 100), wherein 600 represents the number of input samples, 50 represents the maximum number of words contained in each sample, namely each sentence, and 100 represents the word vector dimension of each word; the output dimension of the first layer is (100, 50, 64), the meanings represented by 100 and 50 are unchanged, and 64 represents the number of neurons in the layer.
E2, the input of the second layer LSTM is the output of the first layer; the output of the second layer indicates that 64 neurons are also used in the new layer.
The output of E3, the third level is (600, 32), and since the merge level follows, the whole sequence need not be returned, only the last output of the output sequence needs to be returned, and thus the output is a 2D tensor.
E4, the fourth layer is a merge layer, and the outputs of Forward LSTM and Backward LSTM need to be fused, wherein the fusion mode is "concat", that is, the two tensors are spliced.
E5, the fifth layer is an output layer, the output layer is fully connected with the merge layer, a merge layer vector P is used as input, softmax classification is adopted to classify the vector P, and prediction results of 7 types of classification are output.
The Bi-LSTM model can be built through the five-layer architecture.
And D2, acquiring the annotation data in a semi-supervised mode.
Firstly, marking 10 sentences, which are 70 in total, in seven categories of fault description, fault analysis process, analysis conclusion, fault processing method, field condition, suggestion and countermeasure and fault reason; then, all divided sentences are subjected to doc2vector training sentence vectors, the principle of doc2vector is similar to that of the Word2vector, vectorization representation of each sentence is obtained, and the vectorization representation is stored; calculating the first 10 sentences with the highest similarity by sentence vectors for 70 labeled sentences, expanding the sentences of each category to 60 after duplication removal, and manually screening the 60 sentences to obtain 400 expanded labeled sentences, wherein the quantity is respectively:
TABLE 19 sentence Category number of labels
Figure GDA0002443826540000241
The labeling results are tabulated below:
TABLE 20 example sentence Classification results
Figure GDA0002443826540000242
Figure GDA0002443826540000251
D3, selecting an evaluation index.
The problem of multi-classification is solved in the embodiment, so that the adopted evaluation index is F1-Score to reflect the quality of the model, and the evaluation index is a comprehensive index for harmonizing precision and recall, and the evaluation index is objective and fair. The formula for this parameter is:
Figure GDA0002443826540000252
d4, evaluating the influence of the iteration number and the Dropout rate on sentence classification.
The number of iterations not only determines the performance of the model, but also determines the training time, and the Dropout rate may have an influence on the generalization ability of the model. Therefore, the optimal number of iterations of the model and whether a Dropout layer needs to be added to the model are considered first. Two models, namely a BilSTM _ drop model and a BLSTM _ noddrop model, are trained simultaneously, and the only difference between the two models is that a Dropout layer is not added behind three LSTM layers of the BLSTM _ noddrop model.
Fig. 11 depicts the variation of Loss and F1 score values on the training and validation sets for BLSTM _ drop and BLSTM _ nodop, respectively, as the number of iterations increases.
It can be seen from fig. 11 that after the number of iterations exceeds 15, the value of Loss tends to be stable, which means that the optimization of the model is not greatly affected by continuously increasing the number of iterations, and this result can also be obtained from fig. 12. The optimal number of iterations for the model herein is therefore 15. From fig. 11, it can be seen that all the Loss values of the BLSTM _ drop model on the verification set are greater than those of the BLSTM _ drop model on the verification set, and all the F1_ score values of the BLSTM _ drop model on the verification set in fig. 12 are less than the F1_ score values of the BLSTM _ drop model on the verification set, so that the model herein has the best effect without adding a Dropout layer. This is because the Dropout layer works well in larger data scale models, however the data set herein is relatively small, and adding this layer results in information loss.
D5, BilSTM vs LSTM performance.
To compare the performance of BilSTM to LSTM, the existing optimal Bi-LSTM is compared to the LSTM model with backward propagation removed. The experimental results are shown in fig. 13, and in order to ensure the validity of the comparative experimental results, the experimental results are obtained under the respective optimal parameters of the two models.
It is found that the F value of BilSTM is higher than that of LSTM. The performance of the BilSTM is better than that of the LSTM, because the sequence information of the LSTM is propagated from front to back once and does not contain information propagated from back to front, and the one-way mechanism only can represent the preamble information of the current word of the sentence vocabulary sequence, but does not relate to the postamble information. And adding a reverse LSTM on the basis of the forward LSTM, wherein the forward LSTM is used for capturing the above characteristic information, the reverse LSTM is used for capturing the context characteristic information, and then the global context characteristic information is finally obtained by fusing the captured context characteristic information. Because BiLSTM enables better classification of typical fault case sentences.
Through the above experiments, it can be verified that the classification algorithm for sentence state quantities proposed in this embodiment is effective, and corresponding (attribute, state quantity) binary matching can be directly obtained through the classification result.
The above is a specific implementation manner of structuring the power grid fault case text in this embodiment, named entity identification is performed through CRF + +, and then the digital state quantity and the description attributes thereof are extracted based on rules; for the non-digital state quantity and the description attribute in the phrase form, the embodiment adopts modification matching based on the grammar rule, and for the state quantity in the sentence form, the embodiment proposes a classification algorithm based on the bidirectional LSTM to extract the modification attribute. Finally, experiments prove that the embodiment is effective in text structuring of the power grid fault case.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. A text structuring method for a power grid fault case is characterized by comprising the following steps: the method comprises the following steps:
(1) carrying out named entity recognition on the unstructured text, and constructing an entity dictionary facing to the field of power grids to assist entity recognition and text word segmentation;
(2) extracting attribute values and state quantities describing the attributes, wherein the state quantities are divided into digital state quantities and non-digital state quantities according to types, and extracting and matching the modified attributes of the digital state quantities based on a rule method;
(3) refining the non-digital state quantity, dividing the non-digital state quantity into a state quantity based on a phrase form and a state quantity based on a sentence form, and respectively extracting modified attributes of the state quantity;
(4) and finally generating a plurality of binary groups formed by the attributes and the corresponding state quantities according to the identified attributes and the corresponding state quantities to finish text structuring.
2. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (1), in the stage of marking the training corpus, a dictionary matching method is adopted to automatically mark, and an entity dictionary is continuously perfected by a CRF + + based semi-supervised named entity recognition method.
3. The method as claimed in claim 2, wherein the text structuring method for the grid fault case is as follows: the method specifically comprises the following steps:
(1-1) constructing an initial seed entity dictionary;
(1-2) constructing a training set for training CRF +: automatically labeling by adopting a method based on complete matching, labeling an entity word in an entity dictionary as a named entity when the word appears in a fault case text, and converting the named entity into a format of a CRF + + training file after completing the automatic labeling of the named entity so as to facilitate subsequent model training;
(1-3) training the constructed training corpus by using a CRF + + tool, predicting the test corpus after model training is finished, and finding a new named entity;
(1-4) screening the identified new entities, and adding the screened new entities into an entity dictionary for entity expansion;
(1-5) repeating steps (1-1) - (1-4) until no new valid entities can be found.
4. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (1), the description attributes comprise fault equipment, fault phenomena, fault reasons, fault time and/or voltage levels.
5. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (2), for the attribute described by the digital type state quantity in the text, a rule-based method is adopted for extraction.
6. The method as claimed in claim 5, wherein the text structuring method for the grid fault case is as follows: identifying mathematical units by artificially constructing a mathematical unit mapping dictionary, and extracting the mathematical units in a character form which are completely consistent with the mathematical units in the dictionary; for mathematical symbols, the difference of upper and lower cases is ignored for extraction, after a mathematical unit is extracted, numbers are identified based on a regular expression, and a rule is adopted for matching the numbers and the mathematical unit.
7. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (3), the non-numeric state quantity refers to a text state quantity.
8. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (3), the state quantity based on the phrase form is substantially named entities, the named entities in each case text statement are identified through a conditional random field model, and words modified by the named entities are extracted by using a Stanford Parser tool.
9. The method for structuring the text of the power grid fault case as claimed in claim 8, wherein: by constraining the matching conditions through the set grammar modification rules, the named entity must find the nearest modified noun along the syntax tree, and the noun and the named entity form a binary group.
10. The method as claimed in claim 1, wherein the text structuring method for the grid fault case is as follows: in the step (3), in the structured process, the description of the fault description, the fault analysis process, the analysis conclusion, the fault processing method, the field condition, the suggestion and countermeasure state quantity is sentences in the fault case text, the extraction problem of the description is converted into the classification problem aiming at the sentences, and the bidirectional LSTM and the classification model are constructed to complete the matching of the sentence state quantity and the description attributes.
CN201711325919.8A 2017-12-13 2017-12-13 Text structuring method for power grid fault case Active CN107992597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711325919.8A CN107992597B (en) 2017-12-13 2017-12-13 Text structuring method for power grid fault case

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711325919.8A CN107992597B (en) 2017-12-13 2017-12-13 Text structuring method for power grid fault case

Publications (2)

Publication Number Publication Date
CN107992597A CN107992597A (en) 2018-05-04
CN107992597B true CN107992597B (en) 2020-08-18

Family

ID=62037706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711325919.8A Active CN107992597B (en) 2017-12-13 2017-12-13 Text structuring method for power grid fault case

Country Status (1)

Country Link
CN (1) CN107992597B (en)

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108877951B (en) * 2018-05-24 2022-03-25 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) Radiotherapy structure naming standardization method, device, equipment and medium
CN108829775A (en) * 2018-05-30 2018-11-16 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling logging device title extracting method based on condition random field
CN109508382B (en) * 2018-10-19 2020-08-21 北京明略软件系统有限公司 Label labeling method and device and computer readable storage medium
CN109446523B (en) * 2018-10-23 2023-04-25 重庆誉存大数据科技有限公司 Entity attribute extraction model based on BiLSTM and conditional random field
CN111104798B (en) * 2018-10-27 2023-04-21 北京智慧正安科技有限公司 Resolution method, system and computer readable storage medium for sentencing episodes in legal documents
CN111199156B (en) * 2018-10-31 2023-04-07 北京国双科技有限公司 Named entity recognition method, device, storage medium and processor
CN111199157B (en) * 2018-11-19 2023-04-18 阿里巴巴集团控股有限公司 Text data processing method and device
CN109542969B (en) * 2018-11-23 2023-02-07 国网电力科学研究院武汉南瑞有限责任公司 Text transformer test data structuring system and method
CN109657207B (en) * 2018-11-29 2023-11-03 爱保科技有限公司 Formatting processing method and processing device for clauses
CN109614381A (en) * 2018-12-07 2019-04-12 北京科东电力控制系统有限责任公司 Power scheduling log classification method, device and equipment
CN109684447A (en) * 2018-12-13 2019-04-26 贵州电网有限责任公司 A kind of dispatching of power netwoks running log fault information analysis method based on text mining
CN109800416A (en) * 2018-12-14 2019-05-24 天津大学 A kind of power equipment title recognition methods
CN109685215B (en) * 2018-12-17 2023-01-20 中科国力(镇江)智能技术有限公司 Quick intelligent aid decision support system and method
CN109815500A (en) * 2019-01-25 2019-05-28 杭州绿湾网络科技有限公司 Management method, device, computer equipment and the storage medium of unstructured official document
CN109885491B (en) * 2019-02-12 2022-07-05 科华恒盛股份有限公司 Method for detecting existence of data overflow expression and terminal equipment
CN111597801B (en) * 2019-02-20 2023-09-15 上海颐为网络科技有限公司 Text automatic structuring method and system based on natural language processing
CN109977228B (en) * 2019-03-21 2021-01-12 浙江大学 Information identification method for power grid equipment defect text
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
CN110378585B (en) * 2019-07-08 2022-09-02 国电南瑞科技股份有限公司 Power grid fault handling calculation task arrangement calling method, system and storage medium
CN110501585B (en) * 2019-07-12 2021-05-04 武汉大学 Transformer fault diagnosis method based on Bi-LSTM and analysis of dissolved gas in oil
CN110414012B (en) * 2019-07-29 2022-12-09 腾讯科技(深圳)有限公司 Artificial intelligence-based encoder construction method and related equipment
CN110619046A (en) * 2019-08-30 2019-12-27 交控科技股份有限公司 Fault identification method based on fault tracking table
CN111090999A (en) * 2019-10-21 2020-05-01 南瑞集团有限公司 Information extraction method and system for power grid dispatching plan
JP7160503B2 (en) * 2019-11-06 2022-10-25 三菱電機ビルソリューションズ株式会社 Building information processing equipment
CN111090973A (en) * 2019-11-26 2020-05-01 北京明略软件系统有限公司 Report generation method and device and electronic equipment
CN111026916B (en) * 2019-12-10 2023-07-04 北京百度网讯科技有限公司 Text description conversion method and device, electronic equipment and storage medium
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN111177323B (en) * 2019-12-31 2022-04-01 国网安徽省电力有限公司安庆供电公司 Power failure plan unstructured data extraction and identification method based on artificial intelligence
CN111428981A (en) * 2020-03-18 2020-07-17 国电南瑞科技股份有限公司 Deep learning-based power grid fault plan information extraction method and system
CN111767389A (en) * 2020-05-22 2020-10-13 湖南正宇软件技术开发有限公司 Method and device for recommending case handling unit according to proposed content
CN111783968B (en) * 2020-06-30 2024-05-31 山东信通电子股份有限公司 Power transmission line monitoring method and system based on cloud edge cooperation
CN111859798A (en) * 2020-07-14 2020-10-30 辽宁石油化工大学 Flow industrial fault diagnosis method based on bidirectional long-time and short-time neural network
CN111857097B (en) * 2020-07-27 2023-10-31 中国南方电网有限责任公司超高压输电公司昆明局 Industrial control system abnormality diagnosis information identification method based on word frequency and inverse document frequency
CN112183994B (en) * 2020-09-23 2023-05-12 南方电网数字电网研究院有限公司 Evaluation method and device for equipment state, computer equipment and storage medium
CN112380328B (en) * 2020-11-11 2024-02-06 广州知图科技有限公司 Interaction method and system for safety emergency response robot
CN112507117B (en) * 2020-12-16 2024-02-13 中国南方电网有限责任公司 Deep learning-based automatic overhaul opinion classification method and system
CN112463642B (en) * 2020-12-16 2021-08-03 北京京航计算通讯研究所 Software design defect checking method and system based on fault mode
CN112784584B (en) * 2020-12-23 2024-01-26 北京泰豪智能工程有限公司 Text data element semantic recognition method and device
CN112732934B (en) * 2021-01-11 2022-05-27 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN112949455B (en) * 2021-02-26 2024-04-05 武汉天喻信息产业股份有限公司 Value-added tax invoice recognition system and method
CN113190541A (en) * 2021-05-12 2021-07-30 《中国学术期刊(光盘版)》电子杂志社有限公司 Entity identification method based on digital human
CN114168817A (en) * 2021-11-05 2022-03-11 合肥湛达智能科技有限公司 Semi-supervised learning target identification method
CN115567690A (en) * 2022-09-22 2023-01-03 国网山东省电力公司莒县供电公司 Intelligent monitoring system capable of automatically identifying dangerous points of field operation
CN115546814A (en) * 2022-10-08 2022-12-30 招商局通商融资租赁有限公司 Key contract field extraction method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN106250524A (en) * 2016-08-04 2016-12-21 浪潮软件集团有限公司 Organization name extraction method and device based on semantic information

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090019032A1 (en) * 2007-07-13 2009-01-15 Siemens Aktiengesellschaft Method and a system for semantic relation extraction
US10409908B2 (en) * 2014-12-19 2019-09-10 Google Llc Generating parse trees of text segments using neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN106250524A (en) * 2016-08-04 2016-12-21 浪潮软件集团有限公司 Organization name extraction method and device based on semantic information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于CRFs模型的引文标注技术研究与实现;周俊贤;《中国优秀硕士学位论文全文数据库》;20150615(第6期);第17页第3.1部分以及第27-28页第4.1部分 *
基于条件随机场的音乐领域命名实体识别;郝乐川;《中国优秀硕士学位论文全文数据库》;20140415(第4期);第15-40页第3.3-4.2部分以及图4-5-图4-9 *
基于认知的非结构化信息抽取关键技术与算法研究;穆一夫;《中国博士学位论文全文数据库》;20131015(第10期);第24页最后一段以及第57页第1段 *
用于文本分类的局部化双向长短时记忆;万圣贤等;《中文信息学报》;20170531;第31卷(第3期);第63页第1部分 *

Also Published As

Publication number Publication date
CN107992597A (en) 2018-05-04

Similar Documents

Publication Publication Date Title
CN107992597B (en) Text structuring method for power grid fault case
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN110334354B (en) Chinese relation extraction method
CN112001187B (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
Terechshenko et al. A comparison of methods in political science text classification: Transfer learning language models for politics
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN111753058B (en) Text viewpoint mining method and system
CN110807084A (en) Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN110263325A (en) Chinese automatic word-cut
CN113157859B (en) Event detection method based on upper concept information
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
Xing et al. A convolutional neural network for aspect-level sentiment classification
Zhang et al. n-BiLSTM: BiLSTM with n-gram Features for Text Classification
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113204967B (en) Resume named entity identification method and system
Sartakhti et al. Persian language model based on BiLSTM model on COVID-19 corpus
CN114330338A (en) Program language identification system and method fusing associated information
CN110134950A (en) A kind of text auto-collation that words combines
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN109858550B (en) Machine identification method for potential process failure mode
CN114265936A (en) Method for realizing text mining of science and technology project

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant