CN116562295A

CN116562295A - Method for identifying enhanced semantic named entity for text in bridge field

Info

Publication number: CN116562295A
Application number: CN202310599704.4A
Authority: CN
Inventors: 张永涛; 田唯; 黄灿; 朱浩; 徐双双; 王永威; 肖垚; 李焜耀; 陈圆; 杨华东; 薛现凯; 刘志昂
Original assignee: CCCC Second Harbor Engineering Co; CCCC Highway Long Bridge Construction National Engineering Research Center Co Ltd
Current assignee: CCCC Second Harbor Engineering Co; CCCC Highway Long Bridge Construction National Engineering Research Center Co Ltd
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-08

Abstract

The invention discloses a method for identifying an enhanced semantic named entity for a text in the bridge field, and particularly relates to the technical fields of natural language processing, deep learning and artificial intelligence. The specific implementation scheme is as follows: acquiring a document set in which an entity to be identified in the bridge engineering field is located and analyzing corpus text data of the entity to be identified in the selected document set; labeling a small amount of data and dividing the labeled sample into a training set, a verification set and a test set; training the double-tower model through a training set, and then verifying and adjusting the double-tower model to be an optimal model through a verification set; testing the verified optimal double-tower model through a test set; and carrying out named entity recognition on the text to be recognized in the bridge engineering database by using the verified double-tower model. The method and the device can accurately identify the complex specific named entity in the bridge field under the scene of few samples and without the premise of needing a large amount of marked data.

Description

Method for identifying enhanced semantic named entity for text in bridge field

Technical Field

The invention relates to the technical field of natural language processing. More particularly, the invention relates to a method for identifying an enhanced semantic naming entity for a text in the bridge field.

Background

With the rapid development of artificial intelligence technology, the number of processing demands on text data of natural language is increased, and the acquisition of valuable semantic information from the text data is always one of the key works of research in the field of natural language processing.

The field of bridge engineering has a large number of texts related to bridge knowledge, and can be divided into structured, semi-structured and unstructured data according to the organization property of the data. The unstructured data with the highest duty ratio contains a large amount of related information of bridge engineering, entities contained in texts can be identified through named entity identification tasks, knowledge support is provided for knowledge graph construction in the bridge field, and important basic data can be provided for upper-layer application tasks in the bridge field, such as content auditing, text generation and the like.

At present, a relatively advanced named entity recognition method in the industry is a two-way long short-term memory (BiLSTM-CRF) method based on a conditional random field, the mechanism of the method is to take character embedding and phrase embedding as input, and then a corresponding entity recognition model is constructed through a two-way long-term memory network and a conditional random field model, but the method faces the following challenges in the field of bridge engineering: (1) The marking data in the bridge field is very few, and the method needs a large amount of marking data, so that the feasibility is reduced; (2) The data in the bridge field is strong in specificity, the entity types are numerous, and the method has the problem of low recognition accuracy when the number of entity types is large. Therefore, a method for accurately identifying named entities in the bridge field under the condition of few samples is highly needed.

Disclosure of Invention

The invention aims to provide a method for identifying a text-oriented enhanced semantic named entity in the bridge field, which can accurately identify a complex specific named entity in the bridge field under the condition of few samples and without a large amount of labeled data.

To achieve these objects and other advantages and in accordance with the purpose of the invention, a method for identifying an enhanced semantic naming entity for text in a bridge domain is provided, comprising the steps of:

step S1: preparing a document set of an entity to be identified in the field of bridge engineering;

step S2: analyzing corpus text data of the entity to be identified in the selected document set in the step S1 through a document analysis module;

step S3: selecting corpus text data in the step S2 with a set proportion for marking, and dividing marked samples into a training set, a verification set and a test set;

step S8: training the double-tower model through a training set, and verifying and adjusting the double-tower model to be an optimal model through a verification set after training is completed;

step S9: testing the verified optimal double-tower model through a test set until the optimal double-tower model meets design standards;

step S10: and carrying out named entity recognition on the text to be recognized in the bridge engineering database by using the verified double-tower model.

Preferably, the step S3 specifically includes the following sub-steps:

step S301: the data cleaning, namely cleaning the corpus text data analyzed in the step S2, and specifically removing messy codes, repeated text data and abnormal symbols;

step S302: designing and defining entity categories, and customizing multiple entity types and corresponding labels according to design requirements;

step S303: entity labeling, namely manually labeling part of corpus text data in the step S2 according to multi-class entity types defined in the step S302 and expert experience, and adopting a BIOS labeling mode with word granularity;

step S304: the data set is divided, and a small number of marked samples are divided into a training set, a verification set and a test set.

Preferably, the entity types in step S302 include a related person name, a bridge project related place name, a bridge project related industry name, a related organization name, a date, a compiling basis type name, a compiling basis number and a bridge field professional vocabulary, and the corresponding labels are: PER, LOC, IND, ORG, DATE, CATE, NUM and TER.

Preferably, the enhancement naming entity recognition method further comprises the following steps:

step S4: expanding the tag label through the tag mode and generating a tag mode characterization matrix b;

step S5: expanding the label through the sentence pattern and generating a sentence pattern characterization matrix c;

step S6: adding the two characterization matrixes b and c obtained in the step S4 and the step S5 to obtain a comprehensive characterization matrix d of the tag;

step S7: performing word segmentation operation on the input corpus text data in the step S2 to obtain a list containing a plurality of words, traversing each word in the list, inputting each word into an encoder named BERT document encoder for encoding, and obtaining a characterization vector e of each word;

step S8: multiplying the characterization vector e of each word element by the label comprehensive characterization matrix d obtained in the step S6, performing softmax operation, calculating a label corresponding to the maximum probability value of the word element, and starting training the double-tower model.

Preferably, the step S4 specifically includes the following sub-steps:

step S401: expanding the abbreviation of the label into an English natural language representation form through a matching relation;

step S402: expanding the English obtained in the step S401 by combining the BIOS labeling mode with the word granularity, and further generating a complete English natural language expression form;

step S403: the natural language form expanded by the BIOS mode in the step S402 is input into an encoder named BERT label encoder for encoding, and the [ CLS ] token encryption of BERT is used as the token of the tag to be combined into a tag mode token matrix b.

Preferably, the step S5 specifically includes the following sub-steps:

step S501: expanding the abbreviation of the label into an English natural language representation form through a matching relation;

step S502: matching the natural language representation form obtained in the step S501 with the corpus text data in the step S2 through a matching module, returning the text containing the sentence obtained in the step S501 if the matching is successful, and returning to the blank if the matching is unsuccessful;

step S503: the text obtained in the step S502 is input into an encoder named BERT sentence encoder for encoding, and statement characterization by using [ CLS ] token of BERT as a label is combined into a statement pattern characterization matrix c.

Preferably, the step S9 specifically includes the following sub-steps:

step S901: sending the test set data to the reinforced semantic double-tower model with the training and verification completed, and predicting the model to obtain a corresponding label for each input sentence sequence;

step S902: according to the real labels in the test data set, counting the number of entities corresponding to the real labels in the labels predicted by the model, the total number of entities predicted by the model and the total number of entities in the data set;

step S903: the index accuracy P (Precision), recall R (Recall) and the comprehensive evaluation index F1 value were calculated according to the following formulas:

precision = number of entities for which model predictions are correct/total number of entities for which model predictions are correct;

recall = model predicts the correct number of entities/total number of entities in the dataset;

F1＝2*(Precision*Recall)/(Precision+Recall)；

and the three indexes are tested to meet the design standard.

Preferably, the corpus text data set proportion in the step S2 selected in the step S3 is 5% -10%.

The invention at least comprises the following beneficial effects:

aiming at the problem that the bridge field lacks high-quality annotation data, 8 types of related named entities of the bridge are customized, a corpus recognized by the named entities of the bridge field is constructed through a small amount of manual annotation, a two-stage double-tower model for enhancing semantics is designed, a tag mode representation matrix and a sentence mode representation matrix are generated in the first stage, namely a conventional BIOS tag mode and a specific entity class name are expanded, a tag mode representation matrix corresponding to the tag is generated through Bert, the full scale and corpus data of the tag are matched, and a sentence mode representation matrix corresponding to the tag is generated through Bert; and in the second stage, a representation vector of a sentence word element is obtained through the Bert on an input sentence, and then the named entity recognition is carried out on the text to be recognized through the comprehensive representation matrix obtained in the first stage and the representation vector obtained in the second stage. The method provided by the invention can effectively improve the effect of identifying the named entity in the bridge field under the limitation of few samples by expanding text semantic information, mainly by two semantic enhancement modes of a label enhanced semantic mode and a text enhanced semantic mode and modeling by a two-stage double-tower model on the premise of a small amount of marked data.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

fig. 2 is a flowchart of the algorithm corresponding to the steps S4 to S8 of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the drawings to enable those skilled in the art to practice the invention by referring to the description.

It should be noted that the experimental methods described in the following embodiments, unless otherwise specified, are all conventional methods, and the reagents and materials, unless otherwise specified, are all commercially available; in the description of the present invention, the terms "transverse", "longitudinal", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus are not to be construed as limiting the present invention.

Examples

The invention provides a method for identifying an enhanced semantic named entity oriented to a text in the bridge field, which integrates tag mode semantic features and tag sentence semantic features corresponding to an expanded tag text, wherein a flow diagram is shown in a figure 1, and specifically comprises the following steps:

step S1: preparing a document set of entities to be identified in the bridge engineering field, wherein the documents in the bridge engineering field are various and are classified into word, pdf, txt from the format; the bridge structure field is divided into various bridge structure documents such as steel structures, cast-in-place piles, open caissons, foundation pits, piers, arch bridges and the like.

Step S2: and (3) analyzing the corpus to be identified in various documents in the selected document set in the step (S1), analyzing and generating corpus text data containing the entity to be identified by a document analysis module, wherein the object analyzed by the analysis module comprises various unstructured data such as a table, text, pdf and the like, and the generated corpus text data is in a character string format and consists of a plurality of characters.

Step S3: preparing a small amount of tag data, manually marking the obtained partial corpus by combining expert experience in the bridge field and adopting a BIOS marking mode with word granularity, wherein the marked entity categories are classified into a person name, a place name, an organization name, an industry name, a specific professional name and other categories, and the marking is customized according to actual conditions; the small amount of label data is generally 5% -10% of corpus text data of the entity to be identified; and constructing a named entity recognition data set of the text in the bridge field, and dividing the named entity recognition data set into a training set, a verification set and a test set.

In an alternative embodiment of the invention, in step S3, the following sub-steps are included:

step S301: the data cleaning operation is carried out on the bridge field text analyzed in the step S2, and specifically comprises the steps of removing messy codes, repeated text data and abnormal symbols;

step S302: designing and defining entity categories, wherein in an actual scene corresponding to the application, a class 8 entity is customized, wherein the class 8 entity comprises a related person name (PER), a bridge project related place name (LOC), a bridge project related Industry Name (IND), a related organization name (ORG), a DATE (DATE), a compiling basis class name (CATE), a compiling basis Number (NUM) and a bridge field professional vocabulary (TER), and the labeling specification of the bridge field named entity is shown in a table 1:

TABLE 1 labeling specification for named entities in bridge Domain

Step S303: and (3) entity marking, namely manually marking a small part of bridge field texts according to 8 types of entities defined in the table 1 in the step S302 in combination with expert experience, and adopting a BIOS marking mode with word granularity.

Step S304: dividing the data set, and dividing a small amount of marked samples into a training set, a verification set and a test set, wherein the proportion of the three sets approximately meets 7:1: 2.

Step S4: expanding the English brief name of the label into a natural language form, such as expanding a Person name label 'PER' into 'Person' and expanding an Industry label 'IND' into 'Industry'; expanding the BIOS mode of the tag into a natural language form, such as expanding 'B-PER' into 'begin Person' and 'I-Person' into 'inside Person' and the like; and expanding the tag label through the tag mode and generating a tag mode characterization matrix.

In an alternative embodiment of the present invention, the basic process of expanding the signature and generating the corresponding characterization matrix in step S4 described above comprises the following sub-steps:

step S401: expanding the abbreviation of the label into a complete English natural language expression form through a matching relation, and giving a form after label expansion in a second column of the table 2;

step S402: the English obtained in the step S401 is expanded by combining with the BIOS labeling mode of the word granularity, a complete English natural language expression form is further generated, and the form after the expansion of the step is completed is given in the third column of the table 2;

table 2 tag and extension form

Step S403: inputting the natural language form of the step S402 expanded by BIOS mode into the encoder named BERT label encoder, and using the [ CLS ] of BERT after the encoding of the encoder]token label characterization, denoted b _i I e {1,2,.,. 2*N-1}, where N is the number of tag categories. For all tags, all tokens b _i And combining the two label mode characterization matrixes to form a label mode characterization matrix which is denoted as b.

Step S5: and expanding the label name through the sentence pattern and generating a sentence pattern characterization matrix.

In an alternative embodiment of the present invention, the basic process of expanding the signature and generating the corresponding characterization matrix in step S5 described above comprises the following sub-steps:

step S501: expanding the abbreviation of the label into a complete English natural language expression form through a matching relation, and giving a form after label expansion in a second column of the table 2;

step S502: matching the natural language representation form obtained in the step S501 with the corpus text data obtained through analysis in the step S2 through a matching module, returning to a text containing the sentence obtained in the step S50l if the matching is successful, and returning to the blank if the matching is unsuccessful; for example, "input" successfully matches the sentence in the corpus, and "current situation and development trend of bridge industry" is returned;

step S503: inputting the text obtained in step S502 into an encoder named BERT sentence encoder, and encoding with BERT [ CLS ]]Statement characterization of token ebadd as tag, denoted as c _i I e {1,2,.,. 2*N-1}, where N is the number of tag categories. For all tags, all tokens c _i And combining the two sentence patterns into a sentence pattern characterization matrix which is marked as c.

Step S6: and adding the two matrixes b and c obtained in the step S4 and the step S5 to obtain a comprehensive characterization matrix of the label, and marking the comprehensive characterization matrix as d.

Step S7: performing word segmentation operation on the input sentence to obtain a list containing a plurality of words, traversing each word in the list, inputting each word into an encoder named BERT document encoder for encoding, and obtaining a characterization vector e of each word.

Step S8: in order to obtain the label with the maximum probability value of each word element e, multiplying the characterization vector e of each word element by the label comprehensive characterization matrix d obtained in the step S6, performing softmax operation, calculating the label corresponding to the maximum probability value of the word element, and providing a maximum probability label calculation formula of each word element:

y＝argmax _i softmax(e·d)；

after a calculation formula corresponding to the objective function is built, training a double-tower model is started, and in the training process, the model is trained by minimizing the negative log likelihood probability of sentence level; and performing super-parameter tuning in the model training process by using the verification set data, and checking the state and convergence condition of the model and whether the model is fitted or not.

Step S4 to step S8 can be seen in fig. 2, which is a corresponding algorithm flow chart.

Step S9: and (3) testing the model, predicting the data of the test set by using the optimal model obtained in the step (S8) to obtain a corresponding prediction label, and calculating the number of correct/incorrect detections of the prediction label and the actual label to obtain the detection accuracy, the recall rate and the comprehensive evaluation index F1 value.

In an alternative embodiment of the present invention, the step S9 includes the following sub-steps:

step S901: the test set data is sent to the training and verification finished enhanced semantic double-tower model, and for each input sentence sequence S= { c ₁ ，c ₂ ，...，c _m Model prediction results in corresponding label y= { y ₁ ，y ₂ ，...，y _m }；

F1＝2*(Precision*Recall)/(Precision+Recall)；

and testing to ensure that the three indexes meet the design standard.

Step S10: and carrying out named entity recognition on the text to be recognized in the bridge engineering database by using the trained and tested enhanced semantic double-tower model.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown and described, it is well suited to various fields of use for which the invention would be readily apparent to those skilled in the art, and accordingly, the invention is not limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. The method for identifying the enhanced semantic named entity oriented to the text in the bridge field is characterized by comprising the following steps of:

2. The method for identifying the enhanced semantic naming entity for the text of the bridge domain according to claim 1, wherein the step S3 specifically comprises the following sub-steps:

3. The method for identifying the enhanced semantic naming entity for the text of the bridge domain according to claim 2, wherein the entity types in step S302 include related person names, related place names of bridge projects, related industry names of bridge projects, related organization names, dates, compiling basis category names, compiling basis numbers and professional vocabularies of the bridge domain, and the corresponding labels are: PER, LOC, IND, ORG, DATE, CATE, NUM and TER.

4. The method for identifying the enhanced semantic named entity for the text in the bridge domain according to claim 2, wherein the method for identifying the enhanced named entity further comprises the following steps:

5. The method for identifying the enhanced semantic naming entity for the text in the bridge domain according to claim 4, wherein the step S4 specifically includes the following sub-steps:

6. The method for identifying the enhanced semantic naming entity for the text in the bridge domain according to claim 4, wherein the step S5 specifically comprises the following sub-steps:

7. The method for identifying the enhanced semantic naming entity for the text of the bridge domain according to claim 1, wherein the step S9 specifically comprises the following sub-steps:

F1＝2*(Precision*Recall)/(Precision+Recall)；

and the three indexes are tested to meet the design standard.

8. The method for identifying the enhanced semantic named entity for the text in the bridge domain according to claim 1, wherein the corpus text data set proportion in the step S2 selected in the step S3 is 5% -10%.