CN113901218A

CN113901218A - Inspection business basic rule extraction method and device

Info

Publication number: CN113901218A
Application number: CN202111179406.7A
Authority: CN
Inventors: 赵郭燚; 王宗伟; 金鹏; 卜晓阳; 姜冬; 苏媛; 武鹏; 刘明明; 董玉璐
Original assignee: State Grid Co ltd Customer Service Center
Current assignee: State Grid Co ltd Customer Service Center
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-01-07

Abstract

The invention discloses an inspection business basic rule extraction method and device, which construct a set of professional basic dictionaries in the electric inspection field and a word vector generation model, fully consider the business rule characteristics in the inspection field and the importance difference between words, convert the relationship between entities into an entity type without being limited to the known inspection business relationship, and directly extract the relationship from a text as an entity. The problem of extracting the business rules in the inspection field is effectively solved, the importance difference between text semantic information and words is fully considered, the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method is effectively avoided through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and the accuracy of extracting the entity relation is improved.

Description

Inspection business basic rule extraction method and device

Technical Field

The invention relates to the technical field of electric power business inspection, in particular to an inspection business basic rule extraction method and device.

Background

Business rules refer to descriptions of business definitions and constraints for maintaining business structure or controlling and affecting business behavior. A business rule is also understood to be a set of conditions and operations under the conditions, which are a set of precise condensed statements used to describe, constrain, and control the structure, operation, and strategy of an enterprise, which is a piece of business logic in an application. The theoretical basis is as follows: a set of conditions is set, and when the set of conditions is satisfied, one or more actions are triggered.

In the field of natural language processing, information extraction has been receiving attention. The information extraction mainly comprises 3 subtasks, namely entity extraction, relation extraction and event extraction, wherein the relation extraction is a core task and an important link in the field of information extraction. The tasks involved in the business rule extraction are entity and relationship extraction. The main objective of entity relation extraction is to identify and judge the specific relation existing between entity pairs from natural language texts, which provides basic support for intelligent retrieval, semantic analysis and the like, and is helpful for improving the search efficiency and promoting the automatic construction of a knowledge base.

Different from the task of extracting entity relations in the general open field, the business rule extraction relates to specific field knowledge, related text corpora described by the business rule in the electric power inspection field do not have a word ambiguity, usually have a specific rule of 'entity-relation-entity' sequential arrangement, the entity relation is obvious in a text, but the entity description is complex, the professional terms are many, the relation types among the entities are many, the induction and the arrangement are difficult, a labeling person is required to have field professional knowledge, the manual labeling difficulty is high, the cost is high, the accuracy requirement on the extraction result is strict, and no business rule extraction related research exists in the electric power inspection field at present.

In the prior art, data in specific fields such as open fields or medical treatment and the like are usually based, attention to knowledge in the field of electric power inspection business is lacked, direct transplantation cannot be achieved, professional basic dictionaries in the field of electric power inspection are lacked, and related researches on entity relation extraction in the field of electric power inspection are lacked. Only the semantic information of the words is considered, the importance difference between the words cannot be distinguished, the relation between business rule entities in the power inspection field is various and complex in reality, and the prior art cannot exhaust the relation variety.

Disclosure of Invention

The invention provides an inspection business basic rule extraction method and device, which can directly extract a relation as an entity from a text by converting the relation between the entities into an entity type without being limited to the known inspection business relation.

According to one aspect of the invention, an inspection business basic rule extraction method is provided, which comprises the following steps:

performing text preprocessing on input data to obtain an input cooked corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;

constructing a power inspection field word vector generation model; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;

constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;

according to the trained rule extraction model, taking a word vector matrix of a text in the text corpus of the rule to be extracted as input data, and outputting an entity relation label of each word in the text;

and obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.

The text preprocessing comprises the following steps:

removing empty characters in the input data text;

replacing the dot number in the common sentence and the dot number at the end of the sentence as a blank;

keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged, and removing other punctuation marks and special characters;

based on a general dictionary and an independently constructed professional basic dictionary in the inspection field, a Chinese word segmentation tool is used for segmenting words of an input text to obtain an input cooked corpus.

Weighting to obtain a Word vector of each Word in the input mature corpus text based on Word2vec and an improved Word frequency-inverse document frequency method to obtain a Word vector matrix of each text, wherein the method comprises the following steps:

copying and storing a part of the input cooked corpus D as a corpus D';

training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information₁，e₂，…，e_N}；

Filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;

and (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D:

tfidf_i，j＝tf_i，j×idf_i

wherein tfidf_i，jMeaning word t_iFor text D in corpus D_jThe degree of importance of; tf is_i，jMeaning word t_iIn the text d_jThe frequency of occurrence of; idf_iMeaning word t_iA measure of general importance of; n is_i，jMeaning word t_iIn the text d_jThe number of occurrences in (1); sigma_kn_k，jIs shown in the text d_jThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is t_i∈d_jDenotes the inclusion word t_iTotal number of texts;

for a certain text D in the corpus D_jThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text d_jWeight of all words

Wherein w_i，j＝tfidf_i，j+1, representing the weight of the ith word;

the text d is processed_jInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the text_iN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each word_i，jMultiplying to obtain a final word vector matrix of the text:

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝W_i，j×e_i

wherein v is_i，jRepresenting text d_jA final word vector of the ith word; w is a_i，jRepresenting text d_jWeight of the ith word；e_iRepresenting text d_jThe Word2Vec Word embedded representation of the ith Word;

and repeating the steps on all texts in the corpus D' to obtain a word vector matrix corresponding to each text.

The method for constructing the rule extraction model based on the multi-head self-attention mechanism comprises the following steps:

taking the word vector matrix of the labeled training text corpus as input data, and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;

sending the text feature extraction result to a conditional random field CRF classifier, and outputting the probability score of the label sequence:

wherein W ═ W₁，w₂，…，w_n) Is given text; y ═ y₁，y₂，…，y_n) Is a predicted tag sequence; m is a state transition matrix;

indicating slave label y_iTransfer to label y_i+1The probability of (d);

indicating that the ith word is labeled as label y_iThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;

training the model by using an Adam optimization algorithm in a back propagation process, continuously updating parameters, and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:

wherein the content of the first and second substances,

predicted as a sequence of labels for input text W

A probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.

The method for continuously iterating the model to complete the establishment and parameter tuning of the model by using the word vector matrix of the labeled training text corpus as input data comprises the following steps:

according to the constructed business rule extraction model, taking a text corpus of rules to be extracted as input data, and outputting an entity relationship label of each word of each text in the text corpus;

acquiring all texts in the text corpus of the rule to be extracted one by one to obtain the corresponding word vector matrix;

and the word vector matrix is used as input data and is sent into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and the entity relation label of each word in the text is output to obtain the label sequence result of the current text.

The obtaining of the triple set of the inspection business basic rule corresponding to the text of each rule to be extracted according to the entity relationship tag includes:

establishing an inspection service rule triple extraction model based on a rule expression, extracting relation triples in the text in an entity-relation-entity sequence in the tag sequence, outputting an inspection service basic rule triple set corresponding to the text of each rule to be extracted, and generating an inspection service basic rule.

According to another aspect of the present invention, there is provided an inspection business basic rule extracting device, including:

the preprocessing unit is used for performing text preprocessing on input data to obtain input cooked linguistic data; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;

the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;

the business rule learning unit is used for constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;

the entity relationship extraction unit is used for extracting a model according to the trained rule, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text;

and the inspection service output unit is used for obtaining an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.

The pretreatment unit specifically comprises:

the character processing subunit is used for removing empty characters in the input data text;

the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;

the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;

and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.

The word vector generating unit specifically includes:

a stop word processing subunit, configured to process the stop word according to the input corpus DPreparing a Word embedding expression E ═ E { E } with semantic information by training a Word2Vec Word embedding model based on CBOW by using the corpus D' and storing the corpus D₁，e₂，…，e_N}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;

and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:

tfidf_i，j＝tf_i，j×idf_i

Wherein w_i，j＝tfidf_i，j+1, representing the weight of the ith word;

the text d is processed_jInputting the Wo obtained by trainingrd2Vec word embedding model, and mapping to obtain word embedding representation e of all words of the text_iN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each word_i，jMultiplying to obtain a final word vector matrix of the text:

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝w_i，j×e_i

wherein v is_i，jRepresenting text d_jA final word vector of the ith word; w is a_i，jRepresenting text d_jThe weight of the ith word; e.g. of the type_iRepresenting text d_jThe Word2Vec Word embedded representation of the ith Word;

and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.

The business rule learning unit is specifically configured to:

the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;

and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:

indicating slave label y_iTransfer to label y_i+1The probability of (d);

the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:

wherein the content of the first and second substances,

predicted as a sequence of labels for input text W

By adopting the technical scheme, the invention provides an inspection business basic rule extraction scheme, a set of professional basic dictionary and word vector generation model in the electric inspection field are constructed, the characteristics of the business rules in the inspection field and the importance difference between words are fully considered, and the relation between the entities is converted into an entity type without being limited to the known inspection business relation, so that the relation is directly extracted from the text as an entity.

The embodiment of the invention effectively solves the problem of extracting the business rules in the inspection field, fully considers the importance difference between text semantic information and words, effectively avoids the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and improves the accuracy of extracting the entity relation.

According to the scheme, a set of professional basic dictionaries in the electric power inspection field is established, and in the text cleaning process, aiming at the characteristics and rules of text corpora in the electric power inspection specific field, cleaning rules that common punctuations (including [ in ] (]) (] -)) of official documents are reserved and numbers, letters and capital and lower formats of the letters are not changed are reserved are adopted. When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.

The invention builds a set of electric power inspection field word vector generation model integrating semantic information and importance degree, inputs a text, can automatically map out a word vector matrix, and is the first initiative of the electric power inspection field. According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the invention provides a method for carrying out entity tagging on the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating basic rules extraction for inspection services according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;

FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an inspection business basic rule extracting device in an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

The inspection business basic rule extraction method based on the attention mechanism in the embodiment of the invention uses input data comprising the following steps: and outputting a result to be an inspection business basic rule triple set corresponding to each text of the rule to be extracted.

In the embodiment of the invention, Information Extraction (Information Extraction) is mainly a technology for automatically extracting specific Information from a large amount of text data to be used for accessing a database. An Entity (Entity) is something that is distinguishable and exists independently within itself. But it need not be physically present. And in particular abstract and legal plans, are also commonly considered entities. Relationships (relationships) are explicit or implicit semantic connections between entities. Word2Vec is a collection of correlation models used to generate Word vectors, and after training is complete, each Word can be mapped to a vector to represent Word-to-Word relationships. Word embedding (Word embedding) is a general term for language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. Transformer is a NLP classical model proposed by Google's team in 2017. The transform model uses a Self-Attention mechanism, does not adopt the sequential structure of RNN, so that the model can be trained in a parallelization way and can have global information.

FIG. 1 is a flowchart illustrating an inspection business basic rule extraction process according to an embodiment of the present invention. As shown in fig. 1, the inspection business basic rule extraction process includes the following steps:

step 101, performing text preprocessing on input data to obtain an input cooked corpus.

In an embodiment of the present invention, the input data includes: and checking common text corpora in the field, the marked training text corpora and the text corpora to be extracted according to the rule.

In the embodiment of the invention, a set of professional basic dictionaries in the electric power inspection field and text cleaning rules aiming at the characteristics of business text corpora in the electric power inspection field are constructed. All input data are cleaned and participled, and the detailed steps are as follows:

removing empty characters in the text;

replacing the dot number and the end dot number (including;) in the common sentence as a blank;

the punctuation marks (including < lambda > in </lambda >) (< lambda >) commonly used in official documents are reserved, the capital and lower case formats of numbers, letters and letters are kept unchanged, and other punctuation marks and special characters are removed;

based on a general dictionary and an independently constructed basic dictionary of the inspection field specialty, a jieba Chinese word segmentation tool is used for segmenting words of an input text to obtain a cooked corpus D, and word segmentation results before and after the reference of the basic dictionary of the inspection field specialty are shown as the following table 1:

TABLE 1 reference word segmentation result comparison before and after checking domain dictionary

102, constructing a power inspection field word vector generation model; and weighting based on Word2vec and the improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text.

In the embodiment of the invention, a set of electric power inspection field word vector generation model integrating semantic information and importance degree is established, and a word vector matrix can be automatically mapped by inputting a text. And obtaining vectorization representation of each text through a word vector generating module.

In the embodiment of the invention, a Word vector generation model in the electric power inspection field is constructed, and a Word vector of each Word in a text is obtained based on Word2vec and improved TF-IDF (term frequency-inverse document frequency) weighting, so that a Word vector matrix of each text is obtained. The detailed steps are as follows:

and copying and saving the corpus D after word segmentation as a corpus D'.

Training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information₁，e₂，…，e_NAnd f, considering that the magnitude of the corpus D' is not large, setting a word embedding dimension as a commonly used 128 dimension.

And filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus.

And (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D, wherein the calculation formula is as follows:

tfidf_i，j＝tf_i，j×idf_i

wherein tfidf_i，jMeaning word t_iFor text D in corpus D_jOf importance of, tf_i，jMeaning word t_iIn the text d_jFrequency of occurrence of idf_iMeaning word t_iA measure of the general importance of, n_i，jMeaning word t_iIn the text d_jOf (1) times of occurrence, Σ_kn_k，jIs shown in the text d_jThe sum of the number of occurrences of all words in, | D | represents the total number of texts of the corpus, | { j: t is t_i∈d_jDenotes the inclusion word t_iTotal number of texts.

For a certain text D in the corpus D_jThe method comprises the steps of initializing non-stop words with tfidf values and stop words without tfidf values, initializing the stop words in a text to enable the tfidf values to be 0, reducing the influence degree of the stop words on the text to the minimum, adding 1 to the tfidf values of all words in the text, and avoiding the situation that word vectors of all the stop words are zero vectors after weighting processing_jWeight of all words

Wherein w_i，j＝tfidf_i，j+1 indicates the weight, i.e., the degree of importance, of the ith word. The step provides an improved TFIDF value calculation mode, so that the influence degree of the stop words on the text is effectively reduced, and the condition of zero vector is avoided.

Text d_jInputting a Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the text_iI-1, 2, 3.. n, and embedding the resulting words in a weight w representing the correspondence of each word obtained in step s205_i，jMultiplying to obtain a final word vector matrix of the text, wherein the calculation formula is as follows:

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝w_i，j×e_i

wherein v is_i，jRepresenting text d_jFinal word vector of the ith word, w_i，jRepresenting text d_jWeight of the ith word in, e_iRepresenting text d_jThe Word2Vec Word embedded representation of the ith Word.

And repeatedly executing the operation on all texts in the corpus D', thereby obtaining a word vector matrix corresponding to each text.

And 103, constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment of the model and the parameter tuning.

In the embodiment of the invention, according to the constructed business rule extraction model, the text corpus of the rule to be extracted is taken as input data, and the entity relationship label of each word of each text in the text corpus is output.

In the embodiment of the invention, the entity relationship label of each word in the text is obtained through the business rule extraction module. In the step, a rule extraction model based on a multi-head self-attention mechanism is constructed. In the construction process of the model, the word vector matrix of the labeled training text corpus is used as input data, and the model is continuously iterated to complete the establishment of the model and the parameter tuning. And then embedding the trained model into a business rule extraction module, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text after calculation.

In the embodiment of the invention, the detailed construction steps of the rule extraction model based on the multi-head self-attention mechanism are as follows:

and taking a word vector matrix of the labeled training text corpus as input data, and adding a position vector code to the word vector matrix to express the sequence, wherein the position code adopts a sin and cos calculation mode, and sending the final added result into a simplified Transformer model to extract text characteristics. Considering the limitation of the labeled sample size, in order to reduce the risk of overfitting, the invention builds a simplified Transformer model consisting of 2 encoders and 2 decoders for extracting features.

Sending the output result to a Conditional Random Field (CRF) classifier, and outputting the probability score of the label sequence, wherein the calculation formula is as follows:

wherein W ═ W₁，w₂，…，w_n) For a given text, y ═ y₁，y₂，…，y_n) For the predicted tag sequence, M is a state transition matrix,

indicating slave label y_iTransfer to label y_i+1The probability of (a) of (b) being,

indicating that the ith word is labeled as label y_iS (W, y) is the probability score for the input text W to be predicted as the tag sequence y.

In the back propagation process, an Adam optimization algorithm is used for training a model, parameters are continuously updated, and finally a prediction model of the label is obtained.

Wherein

Predicted as a sequence of labels for input text W

S (W, y) is the probability score that the input text W is predicted to be a true tag sequence y.

And 104, according to the trained rule extraction model, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text.

In the embodiment of the invention, after a rule extraction model based on a multi-head self-attention mechanism is constructed, a complete business rule extraction module is formed, and the input data can obtain a text label sequence result through the business rule extraction module, and the detailed steps are as follows:

acquiring word vector matrixes corresponding to all texts in the text corpus of the rule to be extracted one by one;

and (3) taking the word vector matrix as input data, sending the input data into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and outputting an entity relation label of each word in the text, thereby obtaining a label sequence result of the current text.

And 105, obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relation label.

In the embodiment of the invention, the basic rules of the inspection business are output. Considering that the inspection service rule description text expression modes are uniform, and the corresponding tag sequences present certain regularity, an inspection service rule triple extraction model based on the rule expression is established, the relationship triples in the text are extracted in the tag sequence output in step s312 according to the sequence of 'entity-relationship-entity', and finally, the inspection service basic rule triple set corresponding to the text of each rule to be extracted is output, so as to generate the inspection service basic rule.

Referring to fig. 2, a flow chart of an inspection business basic rule extraction method based on attention mechanism is shown. FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism.

The embodiment of the invention constructs a set of professional basic dictionaries in the electric power inspection field, and adopts a cleaning rule for keeping common punctuations (including [ in ] () ] -) of official documents and keeping unchanged numbers, letters and capital and lower formats of the letters aiming at the characteristics and the rules of text corpora in the electric power inspection specific field in the text cleaning process.

When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.

The embodiment of the invention builds a set of word vector generation model of the electric power inspection field, which integrates semantic information and importance degree, inputs a text, can automatically map a word vector matrix of the text, and is pioneered in the electric power inspection field.

According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the embodiment of the invention provides a method for physically labeling the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.

In order to implement the above process, the technical solution of the present invention further provides an inspection business basic rule extracting device, as shown in fig. 4, the inspection business basic rule extracting device includes:

the preprocessing unit 21 is configured to perform text preprocessing on input data to obtain an input corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;

the word vector generating unit 22 is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;

the business rule learning unit 23 is configured to construct a rule extraction model based on a multi-head self-attention mechanism, and continuously iterate the model by using the word vector matrix of the labeled training text corpus as input data to complete the establishment and parameter tuning of the model;

the entity relationship extraction unit 24 is configured to extract a model according to the trained rule, take a word vector matrix of one text in a text corpus of the rule to be extracted as input data, and output an entity relationship label of each word in the text;

and the inspection service output unit 25 is configured to obtain an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship tag.

The preprocessing unit 21 specifically includes:

The word vector generating unit 22 specifically includes:

a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information₁，e₂，...，e_N}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;

tfidf_i，j＝tf_i，j×idf_i

Wherein w_i，j＝tfidf_i，j+1, representing the weight of the ith word;

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝w_i，j×e_i

The business rule learning unit 23 is specifically configured to:

indicating slave label y_iTransfer to label y_i+1The probability of (d);

wherein the content of the first and second substances,

predicted as a sequence of labels for input text W

In summary, the technical solution of the present invention provides an inspection business basic rule extraction solution, which constructs a set of professional basic dictionaries in the electric power inspection field and a word vector generation model, fully considers the characteristics of business rules in the inspection field and the importance difference between words, and directly extracts the relationship from the text as an entity by converting the relationship between entities into an entity type without being limited to the known inspection business relationship.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An inspection business basic rule extraction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the text preprocessing comprises:

removing empty characters in the input data text;

3. The method as claimed in claim 1, wherein the obtaining of the Word vector of each Word in the input corpus text based on Word2vec and the modified Word frequency-inverse document frequency method weighting comprises:

copying and storing a part of the input cooked corpus D as a corpus D';

tfidf_i，j＝tf_i，j×idf_i

Wherein w_i，j＝tfidf_i，j+1, representing the weight of the ith word;

the text d is processed_jInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the text_iN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each word_i，jMultiplication by multiplicationAnd obtaining a final word vector matrix of the text:

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝w_i，j×e_i

4. The method as claimed in claim 1, wherein the rule extraction model based on the multi-head self-attention mechanism is constructed as follows:

indicating slave label y_iTransfer to label y_i+1The probability of (d);

wherein the content of the first and second substances,

predicted as a sequence of labels for input text W

5. The method as claimed in claim 1, wherein said continuously iterating the model to complete the model building and parameter tuning using the word vector matrix of the labeled training text corpus as input data comprises:

6. The method as claimed in claim 1, wherein the obtaining the triple set of basic rules of audit business corresponding to the text of each rule to be extracted according to the entity relationship label includes:

7. An inspection business basic rule extracting device, comprising:

the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; based on W_oWeighting rd2vec and an improved word frequency-inverse document frequency TF-IDF model to obtain a word vector of each word in the input idiom text, and obtaining a word vector matrix of each text;

8. The inspection business basic rule extraction device of claim 7, wherein the preprocessing unit specifically comprises:

9. The inspection business basic rule extracting device of claim 7, wherein the word vector generating unit specifically comprises:

a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information₁，e₂，…，e_N}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;

tfidf_i，j＝tf_i，j×idf_i

Wherein w_i，j＝tfidf_i，j+1, representing the weight of the ith word;

V_j＝{v_1，j，v_2，j，…，v_n，j}

v_i，j＝w_i，j×e_i

10. The inspection business basic rule extraction device of claim 7, wherein the business rule learning unit is specifically configured to:

indicating slave label y_iTransfer to label y_i+1The probability of (d);

wherein the content of the first and second substances,

predicted as a sequence of labels for input text W

An average probability score; s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.