CN113033206A

CN113033206A - Bridge detection field text entity identification method based on machine reading understanding

Info

Publication number: CN113033206A
Application number: CN202110357215.9A
Authority: CN
Inventors: 李韧; 莫天金
Original assignee: Chongqing Jiaotong University
Current assignee: Chongqing Jiaotong University
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-06-25
Anticipated expiration: 2041-04-01
Also published as: CN113033206B

Abstract

The invention discloses a method for recognizing text entities in the field of bridge detection based on machine reading understanding, which comprises the following steps: s1, acquiring a question text and a target text; s2, extracting character embedding, binary character embedding and weighted word embedding from the question text and the target text; s3, embedding characters, embedding binary characters and embedding and splicing weighted words to obtain joint feature expression; and S4, inputting the combined feature expression into a neural network to complete entity identification. Because the character Embedding only extracts the characteristics at the level of the context character, in order to extract the characteristics with richer semanteme, the invention pertinently introduces the external dictionary information to enhance the characteristic expression of model input, namely introduces a binary Word Embedding (Bigram Embedding) unit and a Weighted Word Embedding (Weighted Word Embedding) unit trained by large-scale corpus, thereby leading the effect of entity identification to be better.

Description

Bridge detection field text entity identification method based on machine reading understanding

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for recognizing text entities in the field of bridge detection based on machine reading understanding.

Background

For years, natural language processing has been an important research direction in the field of artificial intelligence, one of core research tasks, and a great deal of research is carried out under the development of machine learning and deep learning, so that the development is great. However, the application research of the intelligent decision-making in the field of bridge health management and maintenance based on the natural language processing technology is rarely developed, and the bridge is influenced by the traffic load, the environmental excitation, the emergency, the property degradation of the bridge structure material and other internal and external factors in the long-term operation process, so that various diseases of the structure parts are inevitable. Meanwhile, an operation-period bridge health management business system consisting of daily inspection maintenance, frequent detection, regular/special detection, load test, maintenance reinforcement, structural health monitoring and the like is formed in the bridge industry at present, massive bridge health management historical data are accumulated, and the characteristics of obvious data multi-source isomerism, high-speed increase of data volume and the like are presented. However, various health management and maintenance information is still stored in a relational database in a document link mode, when services such as bridge structure state evaluation or management and maintenance decision support are carried out, related documents are still mainly consulted manually, and massive fine-grained structures and disease information are scattered in unstructured texts and need to be identified and extracted. Therefore, based on the application of natural language processing technology, the problem of intelligent aid decision support in the field of bridge management and maintenance still needs to be further solved.

At present, with the rapid development of deep learning, a deep neural network model based on end-to-end becomes mature day by day and becomes a main method and trend of natural language processing problems, and the problems that the performance of a traditional machine learning model depends on feature engineering seriously, the context feature representation capability is insufficient and the like are solved. The named entity recognition is always the fundamental important content of research in the field of natural language processing, and the essence of the named entity recognition is to extract predefined valuable information from semi-structured text or unstructured text information, store the information in a semi-structured mode, and support intelligent applications such as knowledge maps, automatic question answering and the like. Aiming at a bridge detection text named entity recognition task, only a named entity recognition method based on a bridge Onto ontology and semi-supervised CRF (conditional Random fields) for structural state and maintenance activities is provided at present.

Therefore, the problem of how to effectively utilize prior information and the problem of the nested entities is not considered in related research, and the outer layer of the named entity and the nested identification suitable for the description characteristics of the Chinese bridge detection report still need to be further researched.

Disclosure of Invention

Aiming at the defects in the prior art, the invention discloses a method for recognizing text entities in the bridge detection field based on machine reading understanding, which purposefully introduces external dictionary information to enhance the feature expression of model input on the basis of character Embedding, namely introduces a binary Word Embedding (Bigram Embedding) unit and a Weighted Word Embedding (Weighted Word Embedding) unit trained by large-scale linguistic data, thereby ensuring better entity recognition effect.

In order to solve the technical problems, the invention adopts the following technical scheme:

a bridge detection field text entity identification method based on machine reading understanding comprises the following steps:

s1, acquiring a question text and a target text;

s2, extracting character embedding, binary character embedding and weighted word embedding from the question text and the target text;

s3, embedding characters, embedding binary characters and embedding and splicing weighted words to obtain joint feature expression;

and S4, inputting the combined feature expression into a neural network to complete entity identification.

Preferably, the method for extracting character embedding in step S2 includes:

serializing the question text into Q ═ Q₁,q₂,...,q_m]，q_iRepresenting ith character in question text, and representing target text in a serialized mode of C ═ C₁,c₂,...,c]，c_iRepresenting the ith character in the data text;

q and C are connected in series to form X ═ X₁,x₂,...,x_l]，x_iBelongs to Q ═ C and l ═ m + n;

operation of searching character embedding table is carried out to obtain vector matrix capable of being input into BERT model

Ith element of E

w^c(x_i) Representing a character x_iEmbedding characters in table w^cThe vector representation of (1); d represents the dimension of each character vector in the character embedding table;

the vector matrix E is subjected to character embedding, wherein the ith character in the character embedding is

w^bertA character embedding table representing the BERT model.

Preferably, the method for extracting the weighted word embedding in step S2 includes:

the four sets of B, M, E, S are constructed as follows

Wherein D represents an external dictionary, w_i,kRepresenting a subsequence [ X ] in an input sequence X_i,x_i+1,...,x_k]，B(x_i) Representing a sub-sequence w matched in an external dictionary D_i,kMiddle character x_iIs w_i,kThe start character of (a); m (x)_i) Representing a sub-sequence w matched in an external dictionary D_i,kMiddle character x_iIs w_i,kThe middle character of (1); e (x)_i) Representing a sub-sequence w matched in an external dictionary D_i,kMiddle character x_iIs w_i,kThe end character of (1); s (x)_i) Representing the current character matched in the external dictionary D, and if the four sets have the condition of empty matching, filling by using a word NONE;

constructing weighted word embeddings as follows

In the formula (I), the compound is shown in the specification,

indicating the ith character, v, in the embedding of the weighted word^s(B)、v^s(M)、v^s(E)、v^s(S) represents B, M, E, S corresponding weighted representations, respectively.

Preferably, the weighted representation v of the set of words L is calculated as^s(D)

Wherein z (w) represents the frequency of appearance of the word w in the external dictionary D, ω^word(w) a word-embedding representation of the vocabulary w found in the word-embedding table, Z representing a set of word frequencies, Z ═ Σ_{w∈B∪M∪E∪S}z(w)。

Preferably, step S4 includes:

inputting the combined feature expression into a neural network to extract feature information;

and predicting the character probability and the entity span of the characteristic information to complete entity identification.

Preferably, the neural network is BilSTM.

In summary, compared with the prior art, the invention has the following technical effects:

(1) because the character Embedding only extracts the characteristics at the level of the context character, in order to extract the characteristics with richer semanteme, the invention pertinently introduces the external dictionary information to enhance the characteristic expression of model input, namely introduces a binary Word Embedding (Bigram Embedding) unit and a Weighted Word Embedding (Weighted Word Embedding) unit trained by large-scale corpus, thereby leading the effect of entity identification to be better.

(2) Because the BERT pre-training model only supports character-level input for Chinese, the obtained initial vector matrix E is used as the input of the BERT pre-training model, the BERT model can perform position coding on the initial vector matrix E and simultaneously extract more accurate context-related semantic information by adopting a multi-head attention mechanism, and the character-level vector representation output by training and adjusting the BERT pre-training model is more suitable for the context of the bridge detection report text through tiny adjustment.

(3) The method for generating the weighted word embedding not only can well introduce the word embedding in the dictionary, but also has no loss of context information because the matching result can be accurately recovered from the four character sets.

(4) The weighting algorithm does not adopt a dynamic weighting algorithm such as an attention mechanism, but adopts a static value of the occurrence frequency of the vocabularies, because the occurrence frequency of the vocabularies can be obtained by statistics in advance, and the method can greatly accelerate the speed of calculating the weight of each vocabulary.

(5) In the invention, the probability of the character as the start index and the end index and the probability of the whole entity span are calculated, and the method has the advantages that the outgoing entity and the nested entity contained in the text can be decoded simultaneously, and the final answer is obtained.

(6) The invention adopts a BilSTM (Bidirectional Long Short-Term Memory) model for coding and extracts richer context Bidirectional characteristic information.

Drawings

FIG. 1 is a flowchart of an embodiment of a bridge inspection field text entity recognition method based on machine reading understanding disclosed in the present invention;

FIG. 2 is an overall architecture diagram of another embodiment of a bridge inspection field text entity recognition method based on machine reading understanding;

fig. 3 is a diagram illustrating an example of a weighted word embedding unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1 and 2, the invention discloses a method for recognizing a text entity in a bridge detection field based on machine reading understanding, which comprises the following steps:

s1, acquiring a question text and a target text;

The problem text in the present invention includes a priori information. Because the character Embedding only extracts the characteristics at the level of the context character, in order to extract the characteristics with richer semanteme, the invention pertinently introduces the external dictionary information to enhance the characteristic expression of model input, namely introduces a binary Word Embedding (Bigram Embedding) unit and a Weighted Word Embedding (Weighted Word Embedding) unit trained by large-scale corpus, thereby leading the effect of entity identification to be better.

In specific implementation, the method for extracting and embedding characters in step S2 includes:

serializing the question text into Q ═ Q₁,q₂,...,q_m]，q_iRepresenting ith character in question text, and representing target text in a serialized mode of C ═ C₁,c₂,...,c]，c_iRepresenting dataThe ith character in the text;

Ith element of E

w^bertA character embedding table representing the BERT model.

Because the BERT pre-training model only supports character-level input for Chinese, the obtained initial vector matrix E is used as the input of the BERT pre-training model, the BERT model can perform position coding on the initial vector matrix E and simultaneously extract more accurate context-related semantic information by adopting a multi-head attention mechanism, and the character-level vector representation output by training and adjusting the BERT pre-training model is more suitable for the context of the bridge detection report text through tiny adjustment.

The introduction of binary word embedding well copes with the problem of characterization of different entities composed of the same character. In the field of bridge detection, there are a large number of entity expressions composed of two characters, such as "bridge pier", "bridge abutment", "abutment cap", "abutment body", etc., and it is easy to find that most of the entity composed of two characters usually contains the same character, such as "bridge", "pier", "abutment", etc., although the same character, the semantic information expressed in different entities is different. Thus, the input character is converted into binaryWord-embedded expressions to enhance the semantic expression of input data on both entities and non-entities, where w^bRepresenting a binary word embedding table. The ith character of the binary word is

As shown in fig. 3, in specific implementation, the method for extracting weighted word embedding in step S2 includes:

the four sets of B, M, E, S are constructed as follows

constructing weighted word embeddings as follows

In the formula (I), the compound is shown in the specification,

The method for generating the weighted word embedding not only can well introduce the word embedding in the dictionary, but also has no loss of context information because the matching result can be accurately recovered from the four character sets.

In specific implementation, the weighted expression v of the word set L is calculated according to the following formula^s(L)

L is regenerated according to the old dictionary, and the biggest difference between L and the old dictionary is that the embedding of words is weighted by the operation of the formula.

The weighting algorithm does not adopt a dynamic weighting algorithm such as an attention mechanism, but adopts a static value of the occurrence frequency of the vocabularies, because the occurrence frequency of the vocabularies can be obtained by statistics in advance, and the method can greatly accelerate the speed of calculating the weight of each vocabulary.

In specific implementation, step S4 includes:

In the invention, the probability of the character as the start index and the end index and the probability of the whole entity span are calculated, and the method has the advantages that the outgoing entity and the nested entity contained in the text can be decoded simultaneously, and the final answer is obtained.

In specific implementation, the neural network is BilSTM.

The invention adopts a BilSTM (Bidirectional Long Short-Term Memory) model for coding and extracts richer context Bidirectional characteristic information.

Finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that, while the invention has been described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A bridge detection field text entity identification method based on machine reading understanding is characterized by comprising the following steps:

s1, acquiring a question text and a target text;

2. The bridge detection field text entity recognition method based on machine reading understanding of claim 1, wherein the method for extracting character embedding in step S2 comprises:

serializing the question text into Q ═ Q₁,q₂,...,q_m]，q_iRepresenting ith character in question text, and representing target text in a serialized mode of C ═ C₁,c₂,...,c]，c_iNumber of representationsAccording to the ith character in the text;

Ith element of E

w^bertA character embedding table representing the BERT model.

3. The bridge detection field text entity recognition method based on machine reading understanding of claim 2, wherein the method for extracting the weighted word embedding in the step S2 comprises:

the four sets of B, M, E, S are constructed as follows

constructing weighted word embeddings as follows

In the formula (I), the compound is shown in the specification,

4. The bridge detection field text entity recognition method based on machine-readable understanding of claim 3, wherein the weighted representation v of the set of words L is calculated as follows^s(L)

Wherein z (w) represents the frequency at which the word w appears in the external dictionary DRate, ω^word(w) a word-embedding representation of the vocabulary w found in the word-embedding table, Z representing a set of word frequencies, Z ═ Σ_{w∈B∪M∪E∪S}z(w)。

5. The bridge inspection field text entity recognition method based on machine-readable understanding of any one of claims 1 to 4, wherein the step S4 includes:

6. The bridge inspection field text entity recognition method based on machine-readable understanding of claim 5, wherein the neural network is BilSTM.