CN113326700B

CN113326700B - ALBert-based complex heavy equipment entity extraction method

Info

Publication number: CN113326700B
Application number: CN202110217185.1A
Authority: CN
Inventors: 李军怀; 陈苗苗; 王怀军; 曹霆; 于蕾
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-05-14
Anticipated expiration: 2041-02-26
Also published as: CN113326700A

Abstract

The invention discloses a ALBert-based complex heavy equipment entity extraction method, which is implemented according to the following steps: step 1, collecting texts in the field of complex heavy equipment, and constructing a corpus; step 2, pre-training ALBert the model by using the corpus obtained in the step 1 to obtain a pre-trained word representation model ALBert; marking entity names in the corpus obtained in the step 1, and adjusting a text format to an algorithm reading format to obtain a training set and a verification set; step 4, training a model, namely sending marked data into ALBert-BGRU-Attention-CRF algorithm to obtain a trained model; step 5, creating a dictionary Dict; and 6, inputting the text to be extracted into the model obtained in the step 4, and combining the dictionary Dict constructed in the step 5 to obtain an entity extraction result. The invention can complete the entity extraction task in the field of complex heavy equipment.

Description

ALBert-based complex heavy equipment entity extraction method

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to a ALBert-based complex heavy equipment entity extraction method.

Background

The complex heavy equipment is one of important basic equipment in the manufacturing industry, is an important guarantee for social and economic development and national defense industry, and is particularly important as a national heavy equipment. Heavy equipment is used as high-end equipment and is widely applied to important industries and fields such as energy, traffic, ships, engineering machinery, metallurgy, aerospace, military industry and the like. Heavy equipment has long development cycle and complex stages, including early investigation, design, manufacture, purchasing, matching, installation, debugging, delivery, quality control, after-sales service, etc., and a large amount of knowledge is generated in these processes, wherein the large amount of knowledge is stored in a text form.

With the development of new internet technology, the effective management of knowledge and the reuse of knowledge in equipment manufacturing industry can better assist the whole process of design, production and operation and maintenance. The knowledge graph is a high-efficiency and knowledge organization and management mode, one of important links of knowledge graph construction is entity extraction, and the accuracy of the entity extraction determines the accuracy of the knowledge graph to a certain extent. Entity extraction for complex heavy equipment text lays a foundation for subsequent knowledge graph construction, knowledge effective management and knowledge reuse.

Disclosure of Invention

The invention aims to provide a ALBert-based entity extraction method for complex heavy equipment, which can complete the entity extraction task in the field of complex heavy equipment.

The technical scheme adopted by the invention is that the method for extracting the entity of the complex heavy equipment based on ALBert is implemented according to the following steps:

Step 1, collecting texts in the field of complex heavy equipment, and constructing a corpus;

Step 2, pre-training ALBert the model by using the corpus obtained in the step 1 to obtain a pre-trained word representation model ALBert;

Marking entity names in the corpus obtained in the step 1, and adjusting a text format to an algorithm reading format to obtain a training set and a verification set;

step 4, training a model, namely sending marked data into ALBert-BGRU-Attention-CRF algorithm to obtain a trained model;

Step 5, creating a dictionary Dict;

And 6, inputting the text to be extracted into the model obtained in the step 4, and combining the dictionary Dict constructed in the step 5 to obtain an entity extraction result.

The present invention is also characterized in that,

In the step 1, a web crawler framework Scrapy is used for capturing related complex heavy equipment information from a webpage and storing the information as a text file, and the stored text is integrated with an existing complex heavy equipment field document collected manually to serve as a data source; then processing the data source, and eliminating special symbols, formulas and measurement units in the data source; the processed data is used as a corpus and is stored as a text file.

In the step 2, the ALBert model takes a single Chinese character as input, a start identifier [ CLS ] is added before the first word of each sentence, an end identifier [ SEP ] is added at the end of each sentence, ALBert outputs a representation vector which is fused text semantic information of each input word, the latter connection parameters are finely tuned according to the corpus in the data source on the basis of the ALBert pre-training model, and ALBert internal training parameters do not participate in training, so that a finely tuned ALBert model is obtained.

And 3, finishing entity labeling in a manual labeling mode, wherein the labeling entity adopts a BIO labeling mode, the first word of the entity is labeled with a B-Type label, the non-first word of the entity is labeled with an I-Type label, the non-entity and the punctuation mark number are all labeled with O labels, and the Type represents the entity category.

The training model in step 4 is specifically as follows:

Step 4.1, inputting the training set and the verification set obtained in the step 3 into the ALBert model finely tuned in the step2 to generate word vectors;

Step 4.2, inputting the word vector generated in the step 4.1 into a bi-directional gating circulation unit BGRU, and obtaining the scores of each word on all labels;

step 4.3, weighting the result in the step 4.2 by using an attribute mechanism to obtain weighted scores of each word on all labels;

Step 4.4, restraining the tag sequence by using a conditional random field CRF, and reducing the occurrence probability of the abnormal sequence;

And 4.5, obtaining the entity extraction model after training.

The step5 is specifically as follows:

related names including, but not limited to, parts, combinations, product names are extracted from the complex heavy equipment detailed information table as dictionary Dict.

The step 6 is specifically as follows:

Step 6.1, aiming at a large number of texts to be extracted, guiding all the texts into the entity extraction model trained in the step 4, obtaining a primary recognition result, and adding the dictionary Dict constructed in the step 5 on the basis to perform secondary extraction to obtain a final entity extraction result;

and 6.2, aiming at entity extraction of the independent sentences, pasting the sentences to be extracted to an online recognition window in an online recognition mode, calling the model obtained in the step 4 and giving an extraction result by combining with a dictionary Dict.

The method has the beneficial effects that the method for extracting the entity of the complex heavy equipment based on ALBert marks the text of the related information in the existing text and webpage of the field as a corpus, word embedding is realized by using the fine-tuned ALBert, the entity extraction model is obtained by training by using the deep learning algorithm BGRU-Attention-CRF, and the field dictionary is added in order to improve the accuracy of entity extraction and simultaneously consider the special nouns of the industry of the complex heavy equipment. When a new corpus is input, the trained model identifies the entity therein and combines with a dictionary to give a final entity extraction result.

Drawings

FIG. 1 is a general flow chart of a ALBert-based complex heavy equipment entity extraction method of the present invention;

FIG. 2 is a flow chart of the depth learning algorithm ALBert-BGRU-Attention-CRF for establishing the entity extraction model of the complex heavy equipment based on ALBert in the entity extraction method of the complex heavy equipment.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

According to the complex heavy equipment entity extraction method based on ALBert, a flow chart is shown in fig. 1, on the basis of data collection and processing, an entity extraction model is trained by utilizing a ALBert-BGRU-Attention-CRF algorithm based on deep learning, and a final extraction result is obtained by performing primary entity extraction on corpus to be extracted and combining with a dictionary (Dict). The method is implemented according to the following steps:

And 3, developing a manual annotation and automatic format adjustment webpage system. And (3) adopting a manual labeling mode, and utilizing the developed data labeling webpage to finish entity labeling.

The entity labeling and text format adjustment algorithm pseudo code is as follows:

Input: text data to be marked;

Output: labeling data with labels;

1. Text preprocessing:

1.1. Removing line feed and blank spaces in the text, and displaying the text with the arranged format;

1.2. Creating a tag array, and initializing all character tags in a text as O;

2. labeling entities;

2.1. Clicking the label type, selecting an entity corresponding to the label type, and setting the label of the text of the selected entity as the corresponding label type;

2.2. If the full text label is started, searching the full text, and setting all the entity labels with the same name as the selected label type;

3. Generating annotation data in a standard format, outputting text character by character, and attaching a label corresponding to each character and a line-feed character after each character;

Return format specification and tagged data;

the labeling entity adopts a BIO labeling mode, the first word of the entity is marked with a B-Type label, the non-first word of the entity is marked with an I-Type label, the non-entity and punctuation marks are all marked with O labels, and the Type represents the entity category.

For example, there is a corpus of: "a metal extrusion machine is the most important equipment for realizing metal extrusion processing. ", entities are labeled: the gold B-Product belongs to I-Product extrusion I-Product pressure I-Product machine I-Product is O and realizes O-gold B-Way belongs to I-Way extrusion I-Way pressure I-Way and I-Way O most O main O of O equipment O. O

The non-entity information is marked as 'O', 'B-Product' marked with the entity first word of 'Product', 'I-Product' marked with the entity first word of 'Product', 'B-Way' marked with the entity first word of 'processing mode', 'I-Way' marked with the entity first word of 'processing mode'.

Step 4, training a model, namely sending marked data into ALBert-BGRU-Attention-CRF algorithm to obtain a trained model; the flow chart is shown in figure 2 of the drawings,

The training model in step 4 is specifically as follows:

step 4.2, inputting the word vector generated in the step 4.1 into a bi-directional gating circulation unit BGRU (Bidirectional Gated Recurrent Unit), and obtaining the scores of each word on all labels;

Step 4.4, constraint tag sequences are carried out by using a conditional random field CRF (Conditional Random Field), so that the occurrence probability of abnormal sequences is reduced;

And 4.5, obtaining the entity extraction model after training.

The training entity extraction model is as follows:

input: training set and verification set;

Output: extracting a model by an entity;

An import training set, a validation set;

2. Introducing a ALBert model after fine adjustment;

3. importing word vectors into GRU-Attention-CRF;

4. Specifying model parameters;

5. inputting a training set and a verification set to start training;

Return entity extraction model.

Step 5, creating a dictionary Dict;

The step5 is specifically as follows:

The step 6 is specifically as follows:

Claims

1. The ALBert-based complex heavy equipment entity extraction method is characterized by comprising the following steps of:

In the step 1, the web crawler framework Scrapy is used for capturing related complex heavy equipment information from a webpage and storing the related complex heavy equipment information as a text file, and the stored text is integrated with the existing complex heavy equipment field document collected manually to be used as a data source; then processing the data source, and eliminating special symbols, formulas and measurement units in the data source; the processed data is used as a corpus and is stored as a text file;

in the step 2, the ALBert model takes a single Chinese character as input, a start identifier [ CLS ] is added before the first word of each sentence, an end identifier [ SEP ] is added at the tail of each sentence, ALBert outputs a representation vector which is fused with text semantic information of each input word, the connection parameters at the back are finely tuned according to the corpus in the data source on the basis of the ALBert pre-training model, and ALBert internal training parameters do not participate in training, so that a finely tuned ALBert model is obtained;

In the step 3, an artificial labeling mode is adopted to finish entity labeling, a labeling entity adopts a BIO labeling mode, an entity first word is marked with a B-Type label, an entity non-first word is marked with an I-Type label, non-entity and punctuation marks are all marked with O labels, and the Type represents the entity category;

The training model in the step 4 is specifically as follows:

Step 4.1, inputting the training set and the verification set obtained in the step3 into the ALBert model finely tuned in the step 2 to generate word vectors;

step 4.5, obtaining a training entity extraction model;

Step 5, creating a dictionary Dict;

2. The method for extracting entities of complex heavy equipment based on ALBert as set forth in claim 1, wherein said step 5 is specifically as follows:

Related names including parts, combination, product names are extracted from the complex heavy equipment detailed information table as dictionary Dict.

3. The method for extracting entities of complex heavy equipment based on ALBert as claimed in claim 2, wherein said step 6 is specifically as follows: