CN113901218A - Inspection business basic rule extraction method and device - Google Patents
Inspection business basic rule extraction method and device Download PDFInfo
- Publication number
- CN113901218A CN113901218A CN202111179406.7A CN202111179406A CN113901218A CN 113901218 A CN113901218 A CN 113901218A CN 202111179406 A CN202111179406 A CN 202111179406A CN 113901218 A CN113901218 A CN 113901218A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- corpus
- model
- rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007689 inspection Methods 0.000 title claims abstract description 108
- 238000000605 extraction Methods 0.000 title claims abstract description 88
- 239000013598 vector Substances 0.000 claims abstract description 116
- 239000011159 matrix material Substances 0.000 claims description 61
- 238000012549 training Methods 0.000 claims description 46
- 238000000034 method Methods 0.000 claims description 36
- 238000012545 processing Methods 0.000 claims description 23
- 230000007246 mechanism Effects 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 13
- 238000001914 filtration Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 11
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 239000000463 material Substances 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 5
- 238000012550 audit Methods 0.000 claims 1
- 238000013145 classification model Methods 0.000 abstract description 6
- 238000002372 labelling Methods 0.000 abstract description 6
- 239000000284 extract Substances 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 11
- 238000004140 cleaning Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000002054 transplantation Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- Economics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an inspection business basic rule extraction method and device, which construct a set of professional basic dictionaries in the electric inspection field and a word vector generation model, fully consider the business rule characteristics in the inspection field and the importance difference between words, convert the relationship between entities into an entity type without being limited to the known inspection business relationship, and directly extract the relationship from a text as an entity. The problem of extracting the business rules in the inspection field is effectively solved, the importance difference between text semantic information and words is fully considered, the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method is effectively avoided through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and the accuracy of extracting the entity relation is improved.
Description
Technical Field
The invention relates to the technical field of electric power business inspection, in particular to an inspection business basic rule extraction method and device.
Background
Business rules refer to descriptions of business definitions and constraints for maintaining business structure or controlling and affecting business behavior. A business rule is also understood to be a set of conditions and operations under the conditions, which are a set of precise condensed statements used to describe, constrain, and control the structure, operation, and strategy of an enterprise, which is a piece of business logic in an application. The theoretical basis is as follows: a set of conditions is set, and when the set of conditions is satisfied, one or more actions are triggered.
In the field of natural language processing, information extraction has been receiving attention. The information extraction mainly comprises 3 subtasks, namely entity extraction, relation extraction and event extraction, wherein the relation extraction is a core task and an important link in the field of information extraction. The tasks involved in the business rule extraction are entity and relationship extraction. The main objective of entity relation extraction is to identify and judge the specific relation existing between entity pairs from natural language texts, which provides basic support for intelligent retrieval, semantic analysis and the like, and is helpful for improving the search efficiency and promoting the automatic construction of a knowledge base.
Different from the task of extracting entity relations in the general open field, the business rule extraction relates to specific field knowledge, related text corpora described by the business rule in the electric power inspection field do not have a word ambiguity, usually have a specific rule of 'entity-relation-entity' sequential arrangement, the entity relation is obvious in a text, but the entity description is complex, the professional terms are many, the relation types among the entities are many, the induction and the arrangement are difficult, a labeling person is required to have field professional knowledge, the manual labeling difficulty is high, the cost is high, the accuracy requirement on the extraction result is strict, and no business rule extraction related research exists in the electric power inspection field at present.
In the prior art, data in specific fields such as open fields or medical treatment and the like are usually based, attention to knowledge in the field of electric power inspection business is lacked, direct transplantation cannot be achieved, professional basic dictionaries in the field of electric power inspection are lacked, and related researches on entity relation extraction in the field of electric power inspection are lacked. Only the semantic information of the words is considered, the importance difference between the words cannot be distinguished, the relation between business rule entities in the power inspection field is various and complex in reality, and the prior art cannot exhaust the relation variety.
Disclosure of Invention
The invention provides an inspection business basic rule extraction method and device, which can directly extract a relation as an entity from a text by converting the relation between the entities into an entity type without being limited to the known inspection business relation.
According to one aspect of the invention, an inspection business basic rule extraction method is provided, which comprises the following steps:
performing text preprocessing on input data to obtain an input cooked corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
constructing a power inspection field word vector generation model; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
according to the trained rule extraction model, taking a word vector matrix of a text in the text corpus of the rule to be extracted as input data, and outputting an entity relation label of each word in the text;
and obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
The text preprocessing comprises the following steps:
removing empty characters in the input data text;
replacing the dot number in the common sentence and the dot number at the end of the sentence as a blank;
keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged, and removing other punctuation marks and special characters;
based on a general dictionary and an independently constructed professional basic dictionary in the inspection field, a Chinese word segmentation tool is used for segmenting words of an input text to obtain an input cooked corpus.
Weighting to obtain a Word vector of each Word in the input mature corpus text based on Word2vec and an improved Word frequency-inverse document frequency method to obtain a Word vector matrix of each text, wherein the method comprises the following steps:
copying and storing a part of the input cooked corpus D as a corpus D';
training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eN};
Filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all wordsWherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=Wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djWeight of the ith word;eiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and repeating the steps on all texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The method for constructing the rule extraction model based on the multi-head self-attention mechanism comprises the following steps:
taking the word vector matrix of the labeled training text corpus as input data, and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
sending the text feature extraction result to a conditional random field CRF classifier, and outputting the probability score of the label sequence:
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;indicating slave label yiTransfer to label yi+1The probability of (d);indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
training the model by using an Adam optimization algorithm in a back propagation process, continuously updating parameters, and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
wherein the content of the first and second substances,predicted as a sequence of labels for input text WA probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
The method for continuously iterating the model to complete the establishment and parameter tuning of the model by using the word vector matrix of the labeled training text corpus as input data comprises the following steps:
according to the constructed business rule extraction model, taking a text corpus of rules to be extracted as input data, and outputting an entity relationship label of each word of each text in the text corpus;
acquiring all texts in the text corpus of the rule to be extracted one by one to obtain the corresponding word vector matrix;
and the word vector matrix is used as input data and is sent into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and the entity relation label of each word in the text is output to obtain the label sequence result of the current text.
The obtaining of the triple set of the inspection business basic rule corresponding to the text of each rule to be extracted according to the entity relationship tag includes:
establishing an inspection service rule triple extraction model based on a rule expression, extracting relation triples in the text in an entity-relation-entity sequence in the tag sequence, outputting an inspection service basic rule triple set corresponding to the text of each rule to be extracted, and generating an inspection service basic rule.
According to another aspect of the present invention, there is provided an inspection business basic rule extracting device, including:
the preprocessing unit is used for performing text preprocessing on input data to obtain input cooked linguistic data; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
the business rule learning unit is used for constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit is used for extracting a model according to the trained rule, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text;
and the inspection service output unit is used for obtaining an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
The pretreatment unit specifically comprises:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
The word vector generating unit specifically includes:
a stop word processing subunit, configured to process the stop word according to the input corpus DPreparing a Word embedding expression E ═ E { E } with semantic information by training a Word2Vec Word embedding model based on CBOW by using the corpus D' and storing the corpus D1,e2,…,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all wordsWherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Wo obtained by trainingrd2Vec word embedding model, and mapping to obtain word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The business rule learning unit is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;indicating slave label yiTransfer to label yi+1The probability of (d);indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
wherein the content of the first and second substances,predicted as a sequence of labels for input text WA probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
By adopting the technical scheme, the invention provides an inspection business basic rule extraction scheme, a set of professional basic dictionary and word vector generation model in the electric inspection field are constructed, the characteristics of the business rules in the inspection field and the importance difference between words are fully considered, and the relation between the entities is converted into an entity type without being limited to the known inspection business relation, so that the relation is directly extracted from the text as an entity.
The embodiment of the invention effectively solves the problem of extracting the business rules in the inspection field, fully considers the importance difference between text semantic information and words, effectively avoids the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and improves the accuracy of extracting the entity relation.
According to the scheme, a set of professional basic dictionaries in the electric power inspection field is established, and in the text cleaning process, aiming at the characteristics and rules of text corpora in the electric power inspection specific field, cleaning rules that common punctuations (including [ in ] (]) (] -)) of official documents are reserved and numbers, letters and capital and lower formats of the letters are not changed are reserved are adopted. When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The invention builds a set of electric power inspection field word vector generation model integrating semantic information and importance degree, inputs a text, can automatically map out a word vector matrix, and is the first initiative of the electric power inspection field. According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the invention provides a method for carrying out entity tagging on the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating basic rules extraction for inspection services according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;
FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an inspection business basic rule extracting device in an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The inspection business basic rule extraction method based on the attention mechanism in the embodiment of the invention uses input data comprising the following steps: and outputting a result to be an inspection business basic rule triple set corresponding to each text of the rule to be extracted.
In the embodiment of the invention, Information Extraction (Information Extraction) is mainly a technology for automatically extracting specific Information from a large amount of text data to be used for accessing a database. An Entity (Entity) is something that is distinguishable and exists independently within itself. But it need not be physically present. And in particular abstract and legal plans, are also commonly considered entities. Relationships (relationships) are explicit or implicit semantic connections between entities. Word2Vec is a collection of correlation models used to generate Word vectors, and after training is complete, each Word can be mapped to a vector to represent Word-to-Word relationships. Word embedding (Word embedding) is a general term for language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. Transformer is a NLP classical model proposed by Google's team in 2017. The transform model uses a Self-Attention mechanism, does not adopt the sequential structure of RNN, so that the model can be trained in a parallelization way and can have global information.
FIG. 1 is a flowchart illustrating an inspection business basic rule extraction process according to an embodiment of the present invention. As shown in fig. 1, the inspection business basic rule extraction process includes the following steps:
In an embodiment of the present invention, the input data includes: and checking common text corpora in the field, the marked training text corpora and the text corpora to be extracted according to the rule.
In the embodiment of the invention, a set of professional basic dictionaries in the electric power inspection field and text cleaning rules aiming at the characteristics of business text corpora in the electric power inspection field are constructed. All input data are cleaned and participled, and the detailed steps are as follows:
removing empty characters in the text;
replacing the dot number and the end dot number (including;) in the common sentence as a blank;
the punctuation marks (including < lambda > in </lambda >) (< lambda >) commonly used in official documents are reserved, the capital and lower case formats of numbers, letters and letters are kept unchanged, and other punctuation marks and special characters are removed;
based on a general dictionary and an independently constructed basic dictionary of the inspection field specialty, a jieba Chinese word segmentation tool is used for segmenting words of an input text to obtain a cooked corpus D, and word segmentation results before and after the reference of the basic dictionary of the inspection field specialty are shown as the following table 1:
TABLE 1 reference word segmentation result comparison before and after checking domain dictionary
102, constructing a power inspection field word vector generation model; and weighting based on Word2vec and the improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text.
In the embodiment of the invention, a set of electric power inspection field word vector generation model integrating semantic information and importance degree is established, and a word vector matrix can be automatically mapped by inputting a text. And obtaining vectorization representation of each text through a word vector generating module.
In the embodiment of the invention, a Word vector generation model in the electric power inspection field is constructed, and a Word vector of each Word in a text is obtained based on Word2vec and improved TF-IDF (term frequency-inverse document frequency) weighting, so that a Word vector matrix of each text is obtained. The detailed steps are as follows:
and copying and saving the corpus D after word segmentation as a corpus D'.
Training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eNAnd f, considering that the magnitude of the corpus D' is not large, setting a word embedding dimension as a commonly used 128 dimension.
And filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus.
And (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D, wherein the calculation formula is as follows:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjOf importance of, tfi,jMeaning word tiIn the text djFrequency of occurrence of idfiMeaning word tiA measure of the general importance of, ni,jMeaning word tiIn the text djOf (1) times of occurrence, Σknk,jIs shown in the text djThe sum of the number of occurrences of all words in, | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts.
For a certain text D in the corpus DjThe method comprises the steps of initializing non-stop words with tfidf values and stop words without tfidf values, initializing the stop words in a text to enable the tfidf values to be 0, reducing the influence degree of the stop words on the text to the minimum, adding 1 to the tfidf values of all words in the text, and avoiding the situation that word vectors of all the stop words are zero vectors after weighting processingjWeight of all wordsWherein wi,j=tfidfi,j+1 indicates the weight, i.e., the degree of importance, of the ith word. The step provides an improved TFIDF value calculation mode, so that the influence degree of the stop words on the text is effectively reduced, and the condition of zero vector is avoided.
Text djInputting a Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiI-1, 2, 3.. n, and embedding the resulting words in a weight w representing the correspondence of each word obtained in step s205i,jMultiplying to obtain a final word vector matrix of the text, wherein the calculation formula is as follows:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djFinal word vector of the ith word, wi,jRepresenting text djWeight of the ith word in, eiRepresenting text djThe Word2Vec Word embedded representation of the ith Word.
And repeatedly executing the operation on all texts in the corpus D', thereby obtaining a word vector matrix corresponding to each text.
And 103, constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment of the model and the parameter tuning.
In the embodiment of the invention, according to the constructed business rule extraction model, the text corpus of the rule to be extracted is taken as input data, and the entity relationship label of each word of each text in the text corpus is output.
In the embodiment of the invention, the entity relationship label of each word in the text is obtained through the business rule extraction module. In the step, a rule extraction model based on a multi-head self-attention mechanism is constructed. In the construction process of the model, the word vector matrix of the labeled training text corpus is used as input data, and the model is continuously iterated to complete the establishment of the model and the parameter tuning. And then embedding the trained model into a business rule extraction module, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text after calculation.
In the embodiment of the invention, the detailed construction steps of the rule extraction model based on the multi-head self-attention mechanism are as follows:
and taking a word vector matrix of the labeled training text corpus as input data, and adding a position vector code to the word vector matrix to express the sequence, wherein the position code adopts a sin and cos calculation mode, and sending the final added result into a simplified Transformer model to extract text characteristics. Considering the limitation of the labeled sample size, in order to reduce the risk of overfitting, the invention builds a simplified Transformer model consisting of 2 encoders and 2 decoders for extracting features.
Sending the output result to a Conditional Random Field (CRF) classifier, and outputting the probability score of the label sequence, wherein the calculation formula is as follows:
wherein W ═ W1,w2,…,wn) For a given text, y ═ y1,y2,…,yn) For the predicted tag sequence, M is a state transition matrix,indicating slave label yiTransfer to label yi+1The probability of (a) of (b) being,indicating that the ith word is labeled as label yiS (W, y) is the probability score for the input text W to be predicted as the tag sequence y.
In the back propagation process, an Adam optimization algorithm is used for training a model, parameters are continuously updated, and finally a prediction model of the label is obtained.
WhereinPredicted as a sequence of labels for input text WS (W, y) is the probability score that the input text W is predicted to be a true tag sequence y.
And 104, according to the trained rule extraction model, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text.
In the embodiment of the invention, after a rule extraction model based on a multi-head self-attention mechanism is constructed, a complete business rule extraction module is formed, and the input data can obtain a text label sequence result through the business rule extraction module, and the detailed steps are as follows:
acquiring word vector matrixes corresponding to all texts in the text corpus of the rule to be extracted one by one;
and (3) taking the word vector matrix as input data, sending the input data into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and outputting an entity relation label of each word in the text, thereby obtaining a label sequence result of the current text.
And 105, obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relation label.
In the embodiment of the invention, the basic rules of the inspection business are output. Considering that the inspection service rule description text expression modes are uniform, and the corresponding tag sequences present certain regularity, an inspection service rule triple extraction model based on the rule expression is established, the relationship triples in the text are extracted in the tag sequence output in step s312 according to the sequence of 'entity-relationship-entity', and finally, the inspection service basic rule triple set corresponding to the text of each rule to be extracted is output, so as to generate the inspection service basic rule.
Referring to fig. 2, a flow chart of an inspection business basic rule extraction method based on attention mechanism is shown. FIG. 3 is a model structure diagram of an inspection business basic rule extraction method based on attention mechanism.
The embodiment of the invention constructs a set of professional basic dictionaries in the electric power inspection field, and adopts a cleaning rule for keeping common punctuations (including [ in ] () ] -) of official documents and keeping unchanged numbers, letters and capital and lower formats of the letters aiming at the characteristics and the rules of text corpora in the electric power inspection specific field in the text cleaning process.
When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The embodiment of the invention builds a set of word vector generation model of the electric power inspection field, which integrates semantic information and importance degree, inputs a text, can automatically map a word vector matrix of the text, and is pioneered in the electric power inspection field.
According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the embodiment of the invention provides a method for physically labeling the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
In order to implement the above process, the technical solution of the present invention further provides an inspection business basic rule extracting device, as shown in fig. 4, the inspection business basic rule extracting device includes:
the preprocessing unit 21 is configured to perform text preprocessing on input data to obtain an input corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit 22 is used for constructing a word vector generating model in the electric power inspection field; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
the business rule learning unit 23 is configured to construct a rule extraction model based on a multi-head self-attention mechanism, and continuously iterate the model by using the word vector matrix of the labeled training text corpus as input data to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit 24 is configured to extract a model according to the trained rule, take a word vector matrix of one text in a text corpus of the rule to be extracted as input data, and output an entity relationship label of each word in the text;
and the inspection service output unit 25 is configured to obtain an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship tag.
The preprocessing unit 21 specifically includes:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
The word vector generating unit 22 specifically includes:
a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information1,e2,...,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all wordsWherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
The business rule learning unit 23 is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;indicating slave label yiTransfer to label yi+1The probability of (d);indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
wherein the content of the first and second substances,predicted as a sequence of labels for input text WA probability score of (d); s (W, y) is the probability score for the input text W to be predicted as a true tag sequence y.
In summary, the technical solution of the present invention provides an inspection business basic rule extraction solution, which constructs a set of professional basic dictionaries in the electric power inspection field and a word vector generation model, fully considers the characteristics of business rules in the inspection field and the importance difference between words, and directly extracts the relationship from the text as an entity by converting the relationship between entities into an entity type without being limited to the known inspection business relationship.
The embodiment of the invention effectively solves the problem of extracting the business rules in the inspection field, fully considers the importance difference between text semantic information and words, effectively avoids the limitation that a classification model must draw the relation types in advance in the traditional relation extraction method through the electric power inspection business rule triple sequence extraction model which carries out entity labeling on the relation and is based on mode matching, and improves the accuracy of extracting the entity relation.
According to the scheme, a set of professional basic dictionaries in the electric power inspection field is established, and in the text cleaning process, aiming at the characteristics and rules of text corpora in the electric power inspection specific field, cleaning rules that common punctuations (including [ in ] (]) (] -)) of official documents are reserved and numbers, letters and capital and lower formats of the letters are not changed are reserved are adopted. When the weight of the words in the text is calculated, firstly, tfidf values of all non-stop words are calculated, the tfidf value of the stop Word is initialized to be 0, then, the tfidf values of all the words in the text are added with 1 to obtain the weight of the words, and the weight of all the words in the text is used for being multiplied by Word2Vec Word embedding representation thereof, so that semantic information and the importance degree of the words are fused. Compared with the method of directly calculating the tfidf values of all the words in the text and then weighting, the method considers the influence of stop words, reduces the tfidf values, namely the importance degree, to the minimum, and adds 1 to the tfidf values of all the words, so that the situation that the word vectors of all the stop words are zero vectors after weighting is avoided.
The invention builds a set of electric power inspection field word vector generation model integrating semantic information and importance degree, inputs a text, can automatically map out a word vector matrix, and is the first initiative of the electric power inspection field. According to the characteristics that the expression mode of the electric power inspection business rule description text is more standard, the electric power inspection business rule description text has strong regularity and the relation is not fixed, the invention provides a method for carrying out entity tagging on the relation, the limitation that the classification model must draw the relation type in advance in the traditional relation extraction method is effectively avoided, and the electric power inspection business rule triple sequence extraction model based on pattern matching is constructed, so that the effective extraction of the business rule triples is realized.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. An inspection business basic rule extraction method is characterized by comprising the following steps:
performing text preprocessing on input data to obtain an input cooked corpus; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
constructing a power inspection field word vector generation model; weighting based on Word2vec and an improved Word frequency-inverse document frequency TF-IDF model to obtain a Word vector of each Word in the input familiar material text, and obtaining a Word vector matrix of each text;
constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
according to the trained rule extraction model, taking a word vector matrix of a text in the text corpus of the rule to be extracted as input data, and outputting an entity relation label of each word in the text;
and obtaining an inspection business basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
2. The method of claim 1, wherein the text preprocessing comprises:
removing empty characters in the input data text;
replacing the dot number in the common sentence and the dot number at the end of the sentence as a blank;
keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged, and removing other punctuation marks and special characters;
based on a general dictionary and an independently constructed professional basic dictionary in the inspection field, a Chinese word segmentation tool is used for segmenting words of an input text to obtain an input cooked corpus.
3. The method as claimed in claim 1, wherein the obtaining of the Word vector of each Word in the input corpus text based on Word2vec and the modified Word frequency-inverse document frequency method weighting comprises:
copying and storing a part of the input cooked corpus D as a corpus D';
training a Word2Vec Word embedding model based on CBOW by using corpus D' to obtain a Word embedding representation E ═ { E ═ E) with semantic information1,e2,…,eN};
Filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and (3) training the TF-IDF model by using the corpus D with the filtering of the stop words to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all wordsWherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplication by multiplicationAnd obtaining a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and repeating the steps on all texts in the corpus D' to obtain a word vector matrix corresponding to each text.
4. The method as claimed in claim 1, wherein the rule extraction model based on the multi-head self-attention mechanism is constructed as follows:
taking the word vector matrix of the labeled training text corpus as input data, and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
sending the text feature extraction result to a conditional random field CRF classifier, and outputting the probability score of the label sequence:
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;indicating slave label yiTransfer to label yi+1The probability of (d);indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
training the model by using an Adam optimization algorithm in a back propagation process, continuously updating parameters, and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
5. The method as claimed in claim 1, wherein said continuously iterating the model to complete the model building and parameter tuning using the word vector matrix of the labeled training text corpus as input data comprises:
according to the constructed business rule extraction model, taking a text corpus of rules to be extracted as input data, and outputting an entity relationship label of each word of each text in the text corpus;
acquiring all texts in the text corpus of the rule to be extracted one by one to obtain the corresponding word vector matrix;
and the word vector matrix is used as input data and is sent into a trained rule extraction model based on a multi-head self-attention mechanism for prediction, and the entity relation label of each word in the text is output to obtain the label sequence result of the current text.
6. The method as claimed in claim 1, wherein the obtaining the triple set of basic rules of audit business corresponding to the text of each rule to be extracted according to the entity relationship label includes:
establishing an inspection service rule triple extraction model based on a rule expression, extracting relation triples in the text in an entity-relation-entity sequence in the tag sequence, outputting an inspection service basic rule triple set corresponding to the text of each rule to be extracted, and generating an inspection service basic rule.
7. An inspection business basic rule extracting device, comprising:
the preprocessing unit is used for performing text preprocessing on input data to obtain input cooked linguistic data; the input data includes: checking common text corpora in the field, training text corpora which are marked and text corpora of rules to be extracted;
the word vector generating unit is used for constructing a word vector generating model in the electric power inspection field; based on WoWeighting rd2vec and an improved word frequency-inverse document frequency TF-IDF model to obtain a word vector of each word in the input idiom text, and obtaining a word vector matrix of each text;
the business rule learning unit is used for constructing a rule extraction model based on a multi-head self-attention mechanism, using the word vector matrix of the labeled training text corpus as input data, and continuously iterating the model to complete the establishment and parameter tuning of the model;
the entity relationship extraction unit is used for extracting a model according to the trained rule, taking a word vector matrix of one text in the text corpus of the rule to be extracted as input data, and outputting an entity relationship label of each word in the text;
and the inspection service output unit is used for obtaining an inspection service basic rule triple set corresponding to the text of each rule to be extracted according to the entity relationship label.
8. The inspection business basic rule extraction device of claim 7, wherein the preprocessing unit specifically comprises:
the character processing subunit is used for removing empty characters in the input data text;
the point number processing subunit is used for replacing the point numbers in the common sentences and the point numbers at the end of the sentences as blanks;
the special character processing subunit is used for keeping the common punctuation marks of the official documents, keeping the numbers, the letters and the capital and lower case formats of the letters unchanged and removing other punctuation marks and special characters;
and the word segmentation processing subunit is used for segmenting the input text by using a Chinese word segmentation tool based on the general dictionary and the independently constructed professional basic dictionary in the inspection field to obtain the input cooked linguistic data.
9. The inspection business basic rule extracting device of claim 7, wherein the word vector generating unit specifically comprises:
a stop Word processing subunit, configured to copy and store a part of the input cooked corpus D as corpus D ', train a Word2Vec Word embedding model based on CBOW using corpus D', and obtain a Word embedding representation E ═ E with semantic information1,e2,…,eN}; filtering stop words of the corpus D by using the established domain stop dictionary to remove the stop words in the corpus;
and the model training subunit is used for training the TF-IDF model by using the corpus D which is subjected to stop word filtering to obtain tfidf values of all words in the corpus D:
tfidfi,j=tfi,j×idfi
wherein tfidfi,jMeaning word tiFor text D in corpus DjThe degree of importance of; tf isi,jMeaning word tiIn the text djThe frequency of occurrence of; idfiMeaning word tiA measure of general importance of; n isi,jMeaning word tiIn the text djThe number of occurrences in (1); sigmaknk,jIs shown in the text djThe sum of the occurrence times of all the words in the list; | D | represents the total number of texts of the corpus, | { j: t is ti∈djDenotes the inclusion word tiTotal number of texts;
for a certain text D in the corpus DjThe value of the stop word in (1) is initialized to tfidf to be 0; adding 1 to tfidf values of all words in the text to obtain a current text djWeight of all wordsWherein wi,j=tfidfi,j+1, representing the weight of the ith word;
the text d is processedjInputting the Word2Vec Word embedding model obtained by training, and mapping to obtain Word embedding representation e of all words of the textiN, i ═ 1, 2, 3.. n; embedding the obtained words to represent the weight w corresponding to each wordi,jMultiplying to obtain a final word vector matrix of the text:
Vj={v1,j,v2,j,…,vn,j}
vi,j=wi,j×ei
wherein v isi,jRepresenting text djA final word vector of the ith word; w is ai,jRepresenting text djThe weight of the ith word; e.g. of the typeiRepresenting text djThe Word2Vec Word embedded representation of the ith Word;
and the word vector matrix processing subunit is used for repeatedly executing the steps on all the texts in the corpus D' to obtain a word vector matrix corresponding to each text.
10. The inspection business basic rule extraction device of claim 7, wherein the business rule learning unit is specifically configured to:
the text feature extraction subunit is used for taking the word vector matrix of the labeled training text corpus as input data and coding the word vector and the position vector to express the sequence; the position coding adopts a sin and cos calculation mode, the final added result is sent into a simplified Transformer model, and text features are extracted;
and the probability prediction subunit is used for sending the text feature extraction result to the conditional random field CRF classifier and outputting the probability score of the label sequence:
wherein W ═ W1,w2,…,wn) Is given text; y ═ y1,y2,…,yn) Is a predicted tag sequence; m is a state transition matrix;indicating slave label yiTransfer to label yi+1The probability of (d);indicating that the ith word is labeled as label yiThe probability of (d); s (W, y) is the probability score of the input text W being predicted as the label sequence y;
the probability obtaining subunit is used for training the model by using an Adam optimization algorithm in the back propagation process, continuously updating parameters and finally obtaining a prediction model of the label; preferably, a negative log-likelihood function is used as the model loss function, specifically as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111179406.7A CN113901218A (en) | 2021-10-08 | 2021-10-08 | Inspection business basic rule extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111179406.7A CN113901218A (en) | 2021-10-08 | 2021-10-08 | Inspection business basic rule extraction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113901218A true CN113901218A (en) | 2022-01-07 |
Family
ID=79190945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111179406.7A Pending CN113901218A (en) | 2021-10-08 | 2021-10-08 | Inspection business basic rule extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113901218A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648316A (en) * | 2022-05-18 | 2022-06-21 | 国网浙江省电力有限公司 | Digital processing method and system based on inspection tag library |
CN117909492A (en) * | 2024-03-19 | 2024-04-19 | 国网山东省电力公司信息通信公司 | Method, system, equipment and medium for extracting unstructured information of power grid |
-
2021
- 2021-10-08 CN CN202111179406.7A patent/CN113901218A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648316A (en) * | 2022-05-18 | 2022-06-21 | 国网浙江省电力有限公司 | Digital processing method and system based on inspection tag library |
CN114648316B (en) * | 2022-05-18 | 2022-08-23 | 国网浙江省电力有限公司 | Digital processing method and system based on inspection tag library |
CN117909492A (en) * | 2024-03-19 | 2024-04-19 | 国网山东省电力公司信息通信公司 | Method, system, equipment and medium for extracting unstructured information of power grid |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106502985B (en) | neural network modeling method and device for generating titles | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN109960804B (en) | Method and device for generating topic text sentence vector | |
CN110737758A (en) | Method and apparatus for generating a model | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
US20220004545A1 (en) | Method of searching patent documents | |
CN112380863A (en) | Sequence labeling method based on multi-head self-attention mechanism | |
US20210350125A1 (en) | System for searching natural language documents | |
Yuan-jie et al. | Web service classification based on automatic semantic annotation and ensemble learning | |
US20210397790A1 (en) | Method of training a natural language search system, search system and corresponding use | |
CN111274790A (en) | Chapter-level event embedding method and device based on syntactic dependency graph | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN112559734A (en) | Presentation generation method and device, electronic equipment and computer readable storage medium | |
CN110675962A (en) | Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules | |
CN113868380A (en) | Few-sample intention identification method and device | |
CN114564563A (en) | End-to-end entity relationship joint extraction method and system based on relationship decomposition | |
Schäfer et al. | Multilingual ICD-10 Code Assignment with Transformer Architectures using MIMIC-III Discharge Summaries. | |
CN116049422A (en) | Echinococcosis knowledge graph construction method based on combined extraction model and application thereof | |
CN110020024B (en) | Method, system and equipment for classifying link resources in scientific and technological literature | |
CN117194682B (en) | Method, device and medium for constructing knowledge graph based on power grid related file | |
CN113901218A (en) | Inspection business basic rule extraction method and device | |
CN115906835B (en) | Chinese question text representation learning method based on clustering and contrast learning | |
CN115600595A (en) | Entity relationship extraction method, system, equipment and readable storage medium | |
CN114548117A (en) | Cause-and-effect relation extraction method based on BERT semantic enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |