CN113901218A - Inspection business basic rule extraction method and device - Google Patents

Inspection business basic rule extraction method and device Download PDF

Info

Publication number
CN113901218A
CN113901218A CN202111179406.7A CN202111179406A CN113901218A CN 113901218 A CN113901218 A CN 113901218A CN 202111179406 A CN202111179406 A CN 202111179406A CN 113901218 A CN113901218 A CN 113901218A
Authority
CN
China
Prior art keywords
text
word
corpus
model
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111179406.7A
Other languages
Chinese (zh)
Inventor
赵郭燚
王宗伟
金鹏
卜晓阳
姜冬
苏媛
武鹏
刘明明
董玉璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Co ltd Customer Service Center
Original Assignee
State Grid Co ltd Customer Service Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Co ltd Customer Service Center filed Critical State Grid Co ltd Customer Service Center
Priority to CN202111179406.7A priority Critical patent/CN113901218A/en
Publication of CN113901218A publication Critical patent/CN113901218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an inspection business basic rule extraction method and device, which construct a set of professional basic dictionaries in the electric inspection field and a word vector generation model, fully consider the business rule characteristics in the inspection field and the importance difference between words, convert the relationship between entities into an entity type without being limited to the known inspection business relationship, and directly extract the relationship from a text as an entity. The problem of extracting the business rules in the inspection field is effectively solved, the importance difference between text semantic information and words is fully considered, the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method is effectively avoided through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and the accuracy of extracting the entity relation is improved.

Description

Inspection business basic rule extraction method and device
Technical Field
The invention relates to the technical field of electric power business inspection, in particular to an inspection business basic rule extraction method and device.
Background
Business rules refer to descriptions of business definitions and constraints for maintaining business structure or controlling and affecting business behavior. A business rule is also understood to be a set of conditions and operations under the conditions, which are a set of precise condensed statements used to describe, constrain, and control the structure, operation, and strategy of an enterprise, which is a piece of business logic in an application. The theoretical basis is as follows: a set of conditions is set, and when the set of conditions is satisfied, one or more actions are triggered.
In the field of natural language processing, information extraction has been receiving attention. The information extraction mainly comprises 3 subtasks, namely entity extraction, relation extraction and event extraction, wherein the relation extraction is a core task and an important link in the field of information extraction. The tasks involved in the business rule extraction are entity and relationship extraction. The main objective of entity relation extraction is to identify and judge the specific relation existing between entity pairs from natural language texts, which provides basic support for intelligent retrieval, semantic analysis and the like, and is helpful for improving the search efficiency and promoting the automatic construction of a knowledge base.
Different from the task of extracting entity relations in the general open field, the business rule extraction relates to specific field knowledge, related text corpora described by the business rule in the electric power inspection field do not have a word ambiguity, usually have a specific rule of 'entity-relation-entity' sequential arrangement, the entity relation is obvious in a text, but the entity description is complex, the professional terms are many, the relation types among the entities are many, the induction and the arrangement are difficult, a labeling person is required to have field professional knowledge, the manual labeling difficulty is high, the cost is high, the accuracy requirement on the extraction result is strict, and no business rule extraction related research exists in the electric power inspection field at present.
In the prior art, data in specific fields such as open fields or medical treatment and the like are usually based, attention to knowledge in the field of electric power inspection business is lacked, direct transplantation cannot be achieved, professional basic dictionaries in the field of electric power inspection are lacked, and related researches on entity relation extraction in the field of electric power inspection are lacked. Only the semantic information of the words is considered, the importance difference between the words cannot be distinguished, the relation between business rule entities in the power inspection field is various and complex in reality, and the prior art cannot exhaust the relation variety.
Disclosure of Invention
The invention provides an inspection business basic rule extraction method and device, which can directly extract a relation as an entity from a text by converting the relation between the entities into an entity type without being limited to the known inspection business relation.
According to one aspect of the invention, an inspection business basic rule extraction method is provided, which comprises the following steps:
performing text preprocessing on input data to obtain an input cooked corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
constructing a power inspection field word vector generation model; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
according to the trained rule extraction model, taking a word vector matrix of a text in the text corpus of the rule to be extracted as input data, and outputting an entity relation label of each word in the text;
and obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
The text preprocessing comprises the following steps:
removing empty characters in the input data text;
replacing the dot number in the common sentence and the dot number at the end of the sentence as a blank;
keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged, and removing other punctuation marks and special characters;
based on a general dictionary and an independently constructed professional basic dictionary in the inspection field, a Chinese word segmentation tool is used for segmenting words of an input text to obtain an input cooked corpus.
Weighting to obtain a Word vector of each Word in the input mature corpus text based on Word2vec and an improved Word frequency-inverse document frequency method to obtain a Word vector matrix of each text, wherein the method comprises the following steps:
copying and storing a part of the input cooked corpus D as a corpus D';
training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eN};
Filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
Figure BDA0003293713830000031
Figure BDA0003293713830000032
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all words
Figure BDA0003293713830000033
Wherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=Wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djWeight of the ith word;eiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and repeating the steps on all texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The method for constructing the rule extraction model based on the multi-head self-attention mechanism comprises the following steps:
taking the word vector matrix of the labeled training text corpus as input data, and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
sending the text feature extraction result to a conditional random field CRF classifier, and outputting the probability score of the label sequence:
Figure BDA0003293713830000041
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;
Figure BDA0003293713830000042
indicating slave label yiTransfer to label yi+1The probability of (d);
Figure BDA0003293713830000043
indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
training the model by using an Adam optimization algorithm in a back propagation process, continuously updating parameters, and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Figure BDA0003293713830000051
wherein the content of the first and second substances,
Figure BDA0003293713830000052
predicted as a sequence of labels for input text W
Figure BDA0003293713830000053
A probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
The method for continuously iterating the model to complete the establishment and parameter tuning of the model by using the word vector matrix of the labeled training text corpus as input data comprises the following steps:
according to the constructed business rule extraction model, taking a text corpus of rules to be extracted as input data, and outputting an entity relationship label of each word of each text in the text corpus;
acquiring all texts in the text corpus of the rule to be extracted one by one to obtain the corresponding word vector matrix;
and the word vector matrix is used as input data and is sent into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and the entity relation label of each word in the text is output to obtain the label sequence result of the current text.
The obtaining of the triple set of the inspection business basic rule corresponding to the text of each rule to be extracted according to the entity relationship tag includes:
establishing an inspection service rule triple extraction model based on a rule expression, extracting relation triples in the text in an entity-relation-entity sequence in the tag sequence, outputting an inspection service basic rule triple set corresponding to the text of each rule to be extracted, and generating an inspection service basic rule.
According to another aspect of the present invention, there is provided an inspection business basic rule extracting device, including:
the preprocessing unit is used for performing text preprocessing on input data to obtain input cooked linguistic data; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
the business rule learning unit is used for constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit is used for extracting a model according to the trained rule, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text;
and the inspection service output unit is used for obtaining an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
The pretreatment unit specifically comprises:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
The word vector generating unit specifically includes:
a stop word processing subunit, configured to process the stop word according to the input corpus DPreparing a Word embedding expression E ═ E { E } with semantic information by training a Word2Vec Word embedding model based on CBOW by using the corpus D' and storing the corpus D1,e2,…,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
Figure BDA0003293713830000061
Figure BDA0003293713830000071
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all words
Figure BDA0003293713830000072
Wherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Wo obtained by trainingrd2Vec word embedding model, and mapping to obtain word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The business rule learning unit is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
Figure BDA0003293713830000081
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;
Figure BDA0003293713830000085
indicating slave label yiTransfer to label yi+1The probability of (d);
Figure BDA0003293713830000086
indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Figure BDA0003293713830000082
wherein the content of the first and second substances,
Figure BDA0003293713830000083
predicted as a sequence of labels for input text W
Figure BDA0003293713830000084
A probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
By adopting the technical scheme, the invention provides an inspection business basic rule extraction scheme, a set of professional basic dictionary and word vector generation model in the electric inspection field are constructed, the characteristics of the business rules in the inspection field and the importance difference between words are fully considered, and the relation between the entities is converted into an entity type without being limited to the known inspection business relation, so that the relation is directly extracted from the text as an entity.
The embodiment of the invention effectively solves the problem of extracting the business rules in the inspection field, fully considers the importance difference between text semantic information and words, effectively avoids the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and improves the accuracy of extracting the entity relation.
According to the scheme, a set of professional basic dictionaries in the electric power inspection field is established, and in the text cleaning process, aiming at the characteristics and rules of text corpora in the electric power inspection specific field, cleaning rules that common punctuations (including [ in ] (]) (] -)) of official documents are reserved and numbers, letters and capital and lower formats of the letters are not changed are reserved are adopted. When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The invention builds a set of electric power inspection field word vector generation model integrating semantic information and importance degree, inputs a text, can automatically map out a word vector matrix, and is the first initiative of the electric power inspection field. According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the invention provides a method for carrying out entity tagging on the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating basic rules extraction for inspection services according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;
FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an inspection business basic rule extracting device in an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The inspection business basic rule extraction method based on the attention mechanism in the embodiment of the invention uses input data comprising the following steps: and outputting a result to be an inspection business basic rule triple set corresponding to each text of the rule to be extracted.
In the embodiment of the invention, Information Extraction (Information Extraction) is mainly a technology for automatically extracting specific Information from a large amount of text data to be used for accessing a database. An Entity (Entity) is something that is distinguishable and exists independently within itself. But it need not be physically present. And in particular abstract and legal plans, are also commonly considered entities. Relationships (relationships) are explicit or implicit semantic connections between entities. Word2Vec is a collection of correlation models used to generate Word vectors, and after training is complete, each Word can be mapped to a vector to represent Word-to-Word relationships. Word embedding (Word embedding) is a general term for language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. Transformer is a NLP classical model proposed by Google's team in 2017. The transform model uses a Self-Attention mechanism, does not adopt the sequential structure of RNN, so that the model can be trained in a parallelization way and can have global information.
FIG. 1 is a flowchart illustrating an inspection business basic rule extraction process according to an embodiment of the present invention. As shown in fig. 1, the inspection business basic rule extraction process includes the following steps:
step 101, performing text preprocessing on input data to obtain an input cooked corpus.
In an embodiment of the present invention, the input data includes: and checking common text corpora in the field, the marked training text corpora and the text corpora to be extracted according to the rule.
In the embodiment of the invention, a set of professional basic dictionaries in the electric power inspection field and text cleaning rules aiming at the characteristics of business text corpora in the electric power inspection field are constructed. All input data are cleaned and participled, and the detailed steps are as follows:
removing empty characters in the text;
replacing the dot number and the end dot number (including;) in the common sentence as a blank;
the punctuation marks (including < lambda > in </lambda >) (< lambda >) commonly used in official documents are reserved, the capital and lower case formats of numbers, letters and letters are kept unchanged, and other punctuation marks and special characters are removed;
based on a general dictionary and an independently constructed basic dictionary of the inspection field specialty, a jieba Chinese word segmentation tool is used for segmenting words of an input text to obtain a cooked corpus D, and word segmentation results before and after the reference of the basic dictionary of the inspection field specialty are shown as the following table 1:
TABLE 1 reference word segmentation result comparison before and after checking domain dictionary
Figure BDA0003293713830000111
102, constructing a power inspection field word vector generation model; and weighting based on Word2vec and the improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text.
In the embodiment of the invention, a set of electric power inspection field word vector generation model integrating semantic information and importance degree is established, and a word vector matrix can be automatically mapped by inputting a text. And obtaining vectorization representation of each text through a word vector generating module.
In the embodiment of the invention, a Word vector generation model in the electric power inspection field is constructed, and a Word vector of each Word in a text is obtained based on Word2vec and improved TF-IDF (term frequency-inverse document frequency) weighting, so that a Word vector matrix of each text is obtained. The detailed steps are as follows:
and copying and saving the corpus D after word segmentation as a corpus D'.
Training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eNAnd f, considering that the magnitude of the corpus D' is not large, setting a word embedding dimension as a commonly used 128 dimension.
And filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus.
And (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D, wherein the calculation formula is as follows:
tfidfi,j=tfi,j×idfi
Figure BDA0003293713830000121
Figure BDA0003293713830000122
wherein tfidfi,jMeaning word tiFor text D in corpus DjOf importance of, tfi,jMeaning word tiIn the text djFrequency of occurrence of idfiMeaning word tiA measure of the general importance of, ni,jMeaning word tiIn the text djOf (1) times of occurrence, Σknk,jIs shown in the text djThe sum of the number of occurrences of all words in, | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts.
For a certain text D in the corpus DjThe method comprises the steps of initializing non-stop words with tfidf values and stop words without tfidf values, initializing the stop words in a text to enable the tfidf values to be 0, reducing the influence degree of the stop words on the text to the minimum, adding 1 to the tfidf values of all words in the text, and avoiding the situation that word vectors of all the stop words are zero vectors after weighting processingjWeight of all words
Figure BDA0003293713830000123
Wherein wi,j=tfidfi,j+1 indicates the weight, i.e., the degree of importance, of the ith word. The step provides an improved TFIDF value calculation mode, so that the influence degree of the stop words on the text is effectively reduced, and the condition of zero vector is avoided.
Text djInputting a Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiI-1, 2, 3.. n, and embedding the resulting words in a weight w representing the correspondence of each word obtained in step s205i,jMultiplying to obtain a final word vector matrix of the text, wherein the calculation formula is as follows:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djFinal word vector of the ith word, wi,jRepresenting text djWeight of the ith word in, eiRepresenting text djThe Word2Vec Word embedded representation of the ith Word.
And repeatedly executing the operation on all texts in the corpus D', thereby obtaining a word vector matrix corresponding to each text.
And 103, constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment of the model and the parameter tuning.
In the embodiment of the invention, according to the constructed business rule extraction model, the text corpus of the rule to be extracted is taken as input data, and the entity relationship label of each word of each text in the text corpus is output.
In the embodiment of the invention, the entity relationship label of each word in the text is obtained through the business rule extraction module. In the step, a rule extraction model based on a multi-head self-attention mechanism is constructed. In the construction process of the model, the word vector matrix of the labeled training text corpus is used as input data, and the model is continuously iterated to complete the establishment of the model and the parameter tuning. And then embedding the trained model into a business rule extraction module, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text after calculation.
In the embodiment of the invention, the detailed construction steps of the rule extraction model based on the multi-head self-attention mechanism are as follows:
and taking a word vector matrix of the labeled training text corpus as input data, and adding a position vector code to the word vector matrix to express the sequence, wherein the position code adopts a sin and cos calculation mode, and sending the final added result into a simplified Transformer model to extract text characteristics. Considering the limitation of the labeled sample size, in order to reduce the risk of overfitting, the invention builds a simplified Transformer model consisting of 2 encoders and 2 decoders for extracting features.
Sending the output result to a Conditional Random Field (CRF) classifier, and outputting the probability score of the label sequence, wherein the calculation formula is as follows:
Figure BDA0003293713830000141
wherein W ═ W1,w2,…,wn) For a given text, y ═ y1,y2,…,yn) For the predicted tag sequence, M is a state transition matrix,
Figure BDA0003293713830000145
indicating slave label yiTransfer to label yi+1The probability of (a) of (b) being,
Figure BDA0003293713830000146
indicating that the ith word is labeled as label yiS (W, y) is the probability score for the input text W to be predicted as the tag sequence y.
In the back propagation process, an Adam optimization algorithm is used for training a model, parameters are continuously updated, and finally a prediction model of the label is obtained.
Figure BDA0003293713830000142
Wherein
Figure BDA0003293713830000143
Predicted as a sequence of labels for input text W
Figure BDA0003293713830000144
S (W, y) is the probability score that the input text W is predicted to be a true tag sequence y.
And 104, according to the trained rule extraction model, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text.
In the embodiment of the invention, after a rule extraction model based on a multi-head self-attention mechanism is constructed, a complete business rule extraction module is formed, and the input data can obtain a text label sequence result through the business rule extraction module, and the detailed steps are as follows:
acquiring word vector matrixes corresponding to all texts in the text corpus of the rule to be extracted one by one;
and (3) taking the word vector matrix as input data, sending the input data into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and outputting an entity relation label of each word in the text, thereby obtaining a label sequence result of the current text.
And 105, obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relation label.
In the embodiment of the invention, the basic rules of the inspection business are output. Considering that the inspection service rule description text expression modes are uniform, and the corresponding tag sequences present certain regularity, an inspection service rule triple extraction model based on the rule expression is established, the relationship triples in the text are extracted in the tag sequence output in step s312 according to the sequence of 'entity-relationship-entity', and finally, the inspection service basic rule triple set corresponding to the text of each rule to be extracted is output, so as to generate the inspection service basic rule.
Referring to fig. 2, a flow chart of an inspection business basic rule extraction method based on attention mechanism is shown. FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism.
The embodiment of the invention constructs a set of professional basic dictionaries in the electric power inspection field, and adopts a cleaning rule for keeping common punctuations (including [ in ] () ] -) of official documents and keeping unchanged numbers, letters and capital and lower formats of the letters aiming at the characteristics and the rules of text corpora in the electric power inspection specific field in the text cleaning process.
When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The embodiment of the invention builds a set of word vector generation model of the electric power inspection field, which integrates semantic information and importance degree, inputs a text, can automatically map a word vector matrix of the text, and is pioneered in the electric power inspection field.
According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the embodiment of the invention provides a method for physically labeling the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
In order to implement the above process, the technical solution of the present invention further provides an inspection business basic rule extracting device, as shown in fig. 4, the inspection business basic rule extracting device includes:
the preprocessing unit 21 is configured to perform text preprocessing on input data to obtain an input corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit 22 is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
the business rule learning unit 23 is configured to construct a rule extraction model based on a multi-head self-attention mechanism, and continuously iterate the model by using the word vector matrix of the labeled training text corpus as input data to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit 24 is configured to extract a model according to the trained rule, take a word vector matrix of one text in a text corpus of the rule to be extracted as input data, and output an entity relationship label of each word in the text;
and the inspection service output unit 25 is configured to obtain an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship tag.
The preprocessing unit 21 specifically includes:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
The word vector generating unit 22 specifically includes:
a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information1,e2,...,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
Figure BDA0003293713830000171
Figure BDA0003293713830000172
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all words
Figure BDA0003293713830000173
Wherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The business rule learning unit 23 is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
Figure BDA0003293713830000181
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;
Figure BDA0003293713830000185
indicating slave label yiTransfer to label yi+1The probability of (d);
Figure BDA0003293713830000186
indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Figure BDA0003293713830000182
wherein the content of the first and second substances,
Figure BDA0003293713830000183
predicted as a sequence of labels for input text W
Figure BDA0003293713830000184
A probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
In summary, the technical solution of the present invention provides an inspection business basic rule extraction solution, which constructs a set of professional basic dictionaries in the electric power inspection field and a word vector generation model, fully considers the characteristics of business rules in the inspection field and the importance difference between words, and directly extracts the relationship from the text as an entity by converting the relationship between entities into an entity type without being limited to the known inspection business relationship.
The embodiment of the invention effectively solves the problem of extracting the business rules in the inspection field, fully considers the importance difference between text semantic information and words, effectively avoids the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and improves the accuracy of extracting the entity relation.
According to the scheme, a set of professional basic dictionaries in the electric power inspection field is established, and in the text cleaning process, aiming at the characteristics and rules of text corpora in the electric power inspection specific field, cleaning rules that common punctuations (including [ in ] (]) (] -)) of official documents are reserved and numbers, letters and capital and lower formats of the letters are not changed are reserved are adopted. When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The invention builds a set of electric power inspection field word vector generation model integrating semantic information and importance degree, inputs a text, can automatically map out a word vector matrix, and is the first initiative of the electric power inspection field. According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the invention provides a method for carrying out entity tagging on the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. An inspection business basic rule extraction method is characterized by comprising the following steps:
performing text preprocessing on input data to obtain an input cooked corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
constructing a power inspection field word vector generation model; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
according to the trained rule extraction model, taking a word vector matrix of a text in the text corpus of the rule to be extracted as input data, and outputting an entity relation label of each word in the text;
and obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
2. The method of claim 1, wherein the text preprocessing comprises:
removing empty characters in the input data text;
replacing the dot number in the common sentence and the dot number at the end of the sentence as a blank;
keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged, and removing other punctuation marks and special characters;
based on a general dictionary and an independently constructed professional basic dictionary in the inspection field, a Chinese word segmentation tool is used for segmenting words of an input text to obtain an input cooked corpus.
3. The method as claimed in claim 1, wherein the obtaining of the Word vector of each Word in the input corpus text based on Word2vec and the modified Word frequency-inverse document frequency method weighting comprises:
copying and storing a part of the input cooked corpus D as a corpus D';
training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eN};
Filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
Figure FDA0003293713820000021
Figure FDA0003293713820000022
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all words
Figure FDA0003293713820000023
Wherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplication by multiplicationAnd obtaining a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and repeating the steps on all texts in the corpus D' to obtain a word vector matrix corresponding to each text.
4. The method as claimed in claim 1, wherein the rule extraction model based on the multi-head self-attention mechanism is constructed as follows:
taking the word vector matrix of the labeled training text corpus as input data, and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
sending the text feature extraction result to a conditional random field CRF classifier, and outputting the probability score of the label sequence:
Figure FDA0003293713820000031
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;
Figure FDA0003293713820000032
indicating slave label yiTransfer to label yi+1The probability of (d);
Figure FDA0003293713820000033
indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
training the model by using an Adam optimization algorithm in a back propagation process, continuously updating parameters, and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Figure FDA0003293713820000034
wherein the content of the first and second substances,
Figure FDA0003293713820000035
predicted as a sequence of labels for input text W
Figure FDA0003293713820000036
A probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
5. The method as claimed in claim 1, wherein said continuously iterating the model to complete the model building and parameter tuning using the word vector matrix of the labeled training text corpus as input data comprises:
according to the constructed business rule extraction model, taking a text corpus of rules to be extracted as input data, and outputting an entity relationship label of each word of each text in the text corpus;
acquiring all texts in the text corpus of the rule to be extracted one by one to obtain the corresponding word vector matrix;
and the word vector matrix is used as input data and is sent into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and the entity relation label of each word in the text is output to obtain the label sequence result of the current text.
6. The method as claimed in claim 1, wherein the obtaining the triple set of basic rules of audit business corresponding to the text of each rule to be extracted according to the entity relationship label includes:
establishing an inspection service rule triple extraction model based on a rule expression, extracting relation triples in the text in an entity-relation-entity sequence in the tag sequence, outputting an inspection service basic rule triple set corresponding to the text of each rule to be extracted, and generating an inspection service basic rule.
7. An inspection business basic rule extracting device, comprising:
the preprocessing unit is used for performing text preprocessing on input data to obtain input cooked linguistic data; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; based on WoWeighting rd2vec and an improved word frequency-inverse document frequency TF-IDF model to obtain a word vector of each word in the input idiom text, and obtaining a word vector matrix of each text;
the business rule learning unit is used for constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit is used for extracting a model according to the trained rule, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text;
and the inspection service output unit is used for obtaining an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
8. The inspection business basic rule extraction device of claim 7, wherein the preprocessing unit specifically comprises:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
9. The inspection business basic rule extracting device of claim 7, wherein the word vector generating unit specifically comprises:
a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information1,e2,…,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
Figure FDA0003293713820000051
Figure FDA0003293713820000052
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all words
Figure FDA0003293713820000061
Wherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
10. The inspection business basic rule extraction device of claim 7, wherein the business rule learning unit is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
Figure FDA0003293713820000071
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;
Figure FDA0003293713820000072
indicating slave label yiTransfer to label yi+1The probability of (d);
Figure FDA0003293713820000073
indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Figure FDA0003293713820000074
wherein the content of the first and second substances,
Figure FDA0003293713820000075
predicted as a sequence of labels for input text W
Figure FDA0003293713820000076
An average probability score; s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
CN202111179406.7A 2021-10-08 2021-10-08 Inspection business basic rule extraction method and device Pending CN113901218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111179406.7A CN113901218A (en) 2021-10-08 2021-10-08 Inspection business basic rule extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111179406.7A CN113901218A (en) 2021-10-08 2021-10-08 Inspection business basic rule extraction method and device

Publications (1)

Publication Number Publication Date
CN113901218A true CN113901218A (en) 2022-01-07

Family

ID=79190945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111179406.7A Pending CN113901218A (en) 2021-10-08 2021-10-08 Inspection business basic rule extraction method and device

Country Status (1)

Country Link
CN (1) CN113901218A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648316A (en) * 2022-05-18 2022-06-21 国网浙江省电力有限公司 Digital processing method and system based on inspection tag library
CN117909492A (en) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 Method, system, equipment and medium for extracting unstructured information of power grid

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648316A (en) * 2022-05-18 2022-06-21 国网浙江省电力有限公司 Digital processing method and system based on inspection tag library
CN114648316B (en) * 2022-05-18 2022-08-23 国网浙江省电力有限公司 Digital processing method and system based on inspection tag library
CN117909492A (en) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 Method, system, equipment and medium for extracting unstructured information of power grid

Similar Documents

Publication Publication Date Title
CN106502985B (en) neural network modeling method and device for generating titles
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN109960804B (en) Method and device for generating topic text sentence vector
CN110737758A (en) Method and apparatus for generating a model
CN109508459B (en) Method for extracting theme and key information from news
CN111143571B (en) Entity labeling model training method, entity labeling method and device
US20220004545A1 (en) Method of searching patent documents
CN112380863A (en) Sequence labeling method based on multi-head self-attention mechanism
US20210350125A1 (en) System for searching natural language documents
Yuan-jie et al. Web service classification based on automatic semantic annotation and ensemble learning
US20210397790A1 (en) Method of training a natural language search system, search system and corresponding use
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN114416979A (en) Text query method, text query equipment and storage medium
CN112559734A (en) Presentation generation method and device, electronic equipment and computer readable storage medium
CN110675962A (en) Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN113868380A (en) Few-sample intention identification method and device
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
Schäfer et al. Multilingual ICD-10 Code Assignment with Transformer Architectures using MIMIC-III Discharge Summaries.
CN116049422A (en) Echinococcosis knowledge graph construction method based on combined extraction model and application thereof
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN117194682B (en) Method, device and medium for constructing knowledge graph based on power grid related file
CN113901218A (en) Inspection business basic rule extraction method and device
CN115906835B (en) Chinese question text representation learning method based on clustering and contrast learning
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN114548117A (en) Cause-and-effect relation extraction method based on BERT semantic enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination