CN112417888A

CN112417888A - Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm

Info

Publication number: CN112417888A
Application number: CN202011337268.6A
Authority: CN
Inventors: 陆佃龙; 王增林
Original assignee: Jiangsu Netspectrum Data Technology Co ltd
Current assignee: Jiangsu Netspectrum Data Technology Co ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-02-26

Abstract

The invention provides a method for analyzing a sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm, which is characterized in that text data of a new industry is obtained through a web crawler, and semi-supervised labeling is carried out on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention provides a high-precision semantic relation extraction method aiming at information extraction of unstructured texts, which extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.

Description

Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm

Technical Field

The invention relates to the technical field of computers, in particular to a semantic relation analysis method based on a deep learning and pre-training model and an efficient semi-supervised semantic entity relation labeling construction method, and particularly relates to a method for analyzing a sparse semantic relation by combining a BilSTM-CRF algorithm and an R-BERT algorithm.

Background

With the rapid development of the internet, the analysis and processing of mass internet data become a vital task in each industry, mass text data in the internet, especially unstructured text data, contains a large amount of important information, but also contains a large amount of noise and irrelevant information, the distribution of effective entities is sparse, and how to extract data information in unstructured text with high precision becomes important in industry analysis.

Moreover, the relationship between entities is mined in mass data by virtue of a large amount of human resources in the text data relationship extraction in an artificial mode, the traditional machine learning mode is difficult to have higher precision in semantic analysis, and meanwhile, the shortage of labeled data and higher labeling cost also become factors influencing the extraction precision of the semantic entity relationship.

Market data in the internet unstructured article are distributed sparsely, and irrelevant information brings great noise to judgment of the relation between the market data value and the product, time and the like specified by the market data value. The method is characterized in that entity semantic analysis is carried out on articles in emerging industries (such as smart home, Internet of things, 5G and the like) to extract industry key data, a training data set is difficult to construct and train due to the fact that the number of articles is small when the training data is directly hooked with the data of the emerging industries, and how to utilize the data set reported by articles in the traditional industries to carry out data enhancement and migration solves the problem of 'cold start' of data extraction in the emerging industries, and becomes the key of semantic entity relationship analysis.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention provides a method for analyzing sparse semantic relationships by combining the BiLSTM-CRF algorithm and the R-BERT algorithm, so as to solve the problems in the prior art.

In order to achieve the above objects and other related objects, the present invention provides a method for analyzing sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm, comprising the following steps:

acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data;

preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to the training data set and the verification data set;

extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model;

predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities;

and extracting the triad pair of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected.

Optionally, performing semi-supervised labeling on the text data, including:

training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model;

and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value.

Optionally, extracting entities included in text data to be predicted through the trained BilSTM-CRF algorithm model, including:

modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional LSTM model;

and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data.

Optionally, the method further comprises: acquiring entities extracted from text data, and constructing an entity library according to the entities;

based on the entity library, replacing entities of the same type in the traditional industry data in a random replacement mode, and constructing random number word generators of different expression modes aiming at the types of the general entities;

and randomly generating and replacing the general entities in the original labeled text data through the random number word generator, and expanding the labeled data.

Optionally, if the text data is an article, predicting a relationship between the text data to be predicted and the entity through the trained R-BERT algorithm model, including:

acquiring a BERT original model, adopting [ CLS ] marks to represent the characteristics of the whole sentence types in the article through the BERT original model, and using [ SEP ] to segment a plurality of sentences in the input article; and the number of the first and second groups,

coding by combining BERT input with an upstream extracted entity and adopting a structure of [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] };

connecting the sentence vector, the subject entity vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein, H is used_s＝[h_s1，h_s1+1，...，h_s2]Representing a subject entity vector; using H_o＝[h_o1，h_o1+1，...，h_o2]Representing a guest entity vector.

As mentioned above, the invention provides a method for analyzing sparse semantic relationship by combining the BilSTM-CRF algorithm and the R-BERT algorithm, which has the following beneficial effects:

acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of a framework according to an embodiment;

FIG. 3 is a schematic modeling flow diagram provided by an embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1 to 3, the present invention provides a method for analyzing sparse semantic relationship by combining the BiLSTM-CRF algorithm and the R-BERT algorithm, which includes the following steps:

s100, acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data;

s200, preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set;

s300, extracting entities contained in the text data to be predicted through the trained BilSTM-CRF algorithm model;

s400, predicting the relation between the text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities;

and S500, extracting the triad pair of semantic relations of the text data to be tested according to the established relation connection, and completing semantic analysis of the text data to be tested.

The method comprises the steps of acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.

According to the above description, in an exemplary embodiment, the semi-supervised labeling of the text data includes: training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model; and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value. The invention provides a semi-supervised text data entity and relationship labeling process. Aiming at the entity semantic relation analysis of the market text materials in the new field and the emerging industry in the Internet, the training data set labeling method combining the novel remote supervision mode and the incremental learning mode based on data migration and enhancement is innovatively provided. Specifically, for the situation that the emerging industry market analysis articles are lacked and the number of text sentences containing the required data is small, the project adopts a mode of constructing the emerging industry entity library to enhance the market analysis data set of the traditional industry. By constructing entity libraries of products, companies and the like in emerging industries, for traditional industry data, entities of the same type are replaced from the entity libraries in a random replacement mode, random number word generators of different expression modes are constructed for general entity types such as time, amount and the like, and positions of time, amount and the like in original labeled text data are randomly generated and replaced. The method solves the problem of data shortage in emerging industries and greatly expands the labeled data set automatically, and meanwhile, the model can learn the language knowledge of the context better, and compared with a common method, the robustness and the accuracy of the model are improved. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, wherein the confidence coefficient of the algorithm is higher than a threshold value D_tresholdThe predicted result of (1) is directly used as a mark, and data with low confidence coefficient is manually re-marked. The invention provides a migration method for parsing semantic relation of text entities, which can fully migrate among data analysis of various industries, effectively utilize historical labeled data, and greatly reduce manual labeling on new tasks, especially on industrial data analysis in emerging fieldsThe marking task in the emerging industry with less data volume is effectively solved. The method is different from a general full-manual labeling process, adopts a semi-supervised data labeling process combining remote supervision and incremental learning, manually labels a small amount of data, adds the labeled semantic relation into a remote knowledge base, and directly labels the semantic entity relation appearing in the knowledge base in a subsequent text in a remote supervision mode in new data as a label. And carrying out entity replacement on the marked text data by using the constructed entity library and the time and money random generator, and enhancing the data set. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, directly using a prediction result with high confidence coefficient of the algorithm as a label, and manually re-labeling the data with low confidence coefficient. The method greatly improves the efficiency of data marking and reduces the labor cost.

According to the above description, in an exemplary embodiment, the extracting, by using the trained BiLSTM-CRF algorithm model, the entities contained in the text data to be predicted includes: modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional LSTM model; and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data. In the embodiment, named entity recognition is performed on the article by using a deep learning sequence tagging model for the tagged article, so that entities existing in the article are recognized. And (3) modeling the sequence forward and backward by using a bidirectional LSTM model, and scoring the whole prediction path by using a relation between Conditional Random Field (CRF) constraint label results so as to extract entities contained in the article sentences.

According to the above description, in an exemplary embodiment, if the text data is an article, predicting the relationship between the text data to be predicted and the entity through the trained R-BERT algorithm model includes: obtaining a BERT original model, and adopting [ CLS ] through the BERT original model]The tokens represent features of the overall type of sentence in the article and use SEP]Segmenting a plurality of sentences in an input article; and, by pumping the input of BERT upstreamCombining the obtained entities by using { [ CLS { []Article sentence [ SEP]Subject [ object ]][SEP]The structure of the code is coded; connecting the sentence vector, the subject entity vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein, H is used_s＝[h_s1，h_s1+1，...，h_s2]Representing a subject entity vector; using H_o＝[h_o1，h_o1+1，...，h_o2]Representing a guest entity vector. The method trains an entity recognition model and a relation prediction model respectively through labeled data. Constructing a BilSTM-CRF, training entity identification, modeling sequences forward and backward by using a bidirectional LSTM model, and constraining the relation between label results by using a Conditional Random Field (CRF); constructing an R-BERT model as a relational analysis model, pruning the dependency relationship of the entities extracted in the previous step of the data text as a semantic center, removing unnecessary text noise caused by sparse entity distribution, and processing the model forward input data into { [ CLS ]]Article sentence [ SEP]Subject [ object ]][SEP]The structure in the form of the Chinese character } trains the R-BERT model; predicting entities appearing in the text by the data to be predicted through an entity recognition model BiLSTM-CRF; and respectively combining the entities extracted in the last step, integrating the entities with the original text according to the format of the training data, and inputting a model to predict the relationship among the entities. The invention provides a method for analyzing semantic relation between entities by using R-BERT. BERT uses a multi-layer self-attention mechanism to carry out bidirectional coded representation on text, and semantic syntax information of different levels of the text is extracted from low to high. The Model is pre-trained by using mass text data, and the effect of training the Language Model is achieved by shielding 15% of words in the text and predicting the shielded words in a Masked mode by using a Masked Language Model. Migration learning using BERT can provide strong support for downstream tasks in general. BERT original model adopted [ CLS]The mark represents the integral type characteristic of the sentence, SEP is used for dividing a plurality of input sentences, R-BERT structure is innovatively proposed for semantic relation analysis aiming at a special input structure in BERT, and by combining the input of the BERT with an upstream extracted entity, { [ CLS ] is adopted]Article sentence [ SEP]Subject [ object ]][SEP]Structure ofAnd (5) line coding. Using H_s＝[h_s1，h_s1+1，...，h_s2]Vectors representing subject entities, using H_o＝[h_o1，h_o1+1，...，h_o2]A vector representing an object entity. And finally, predicting the relation type through full connection and softmax by connecting the sentence vector, the subject entity vector and the object entity vector.

h′＝W[concat(H，H_s，H_o)]+b；

p＝softmax(h′)。

Aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to carry out pruning operation on irrelevant information in the text, the extracted entities are taken as semantic core words, pruning is carried out through syntactic dependency, limited step length texts with dependency relationship with the entities are reserved, and unnecessary text noise caused by sparse entity distribution is removed. The method can effectively solve the problem of sparse entity distribution and improve the accuracy of entity relationship analysis. And adding the entity pairs into the model in an effective mode, and extracting the characteristics to predict the relation between the entities. Meanwhile, in order to prevent the occurrence of overfitting, the entity in the sentence is not seen by the model, and the entity pair to be predicted is shielded in the sentence [ MASK ]. Compared with the existing relation extraction mode, the method has obvious effect improvement. The invention discloses a method for removing text noise by using syntax dependence analysis, which comprises the following steps: aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to extract semantic center word entities, limited step length texts with dependency relations with the entities are reserved, and unnecessary text noise caused by sparse entity distribution is removed.

In summary, the invention innovatively provides a semantic relation data set construction mode and a semantic relation extraction method in a semi-supervised mode aiming at information extraction of unstructured texts based on a natural language processing technology under deep learning. Firstly, extracting product entities, money amount, time, place, mechanism and other entities contained in a text through an algorithm model of BilSTM-CRF, predicting the relation between the entities through an R-BERT model for the text and the extracted entities, establishing relation connection between related entities, extracting a triplet group of semantic relations, and completing text semantic analysis. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model. The invention is a natural language analysis process combining natural language processing with deep learning pre-training model and having better prediction effect through practice exploration research, the algorithm is efficient, the pertinence is strong, and the process flow highly fits data analysis service. In the engineering application practice, compared with the rule-based extraction process and the general technical method commonly adopted by the text data mining project, the project has the advantages of obviously higher originality, creativity and benefit. Moreover, the invention has the following three advantages:

(1) aiming at the problem that the number of articles is small due to the fact that training data of emerging industries (such as smart home, the Internet of things, 5G and the like) are constructed directly and are hooked with the emerging industries, a novel training data set labeling method combining a remote supervision mode and an incremental learning mode based on data migration and enhancement is innovatively provided. And enhancing the market quotation analysis data set of the traditional industry by adopting a mode of constructing an emerging industry entity library. By constructing entity libraries of products, companies and the like in emerging industries, for the labeled data of the traditional industries, entities of the same type are replaced from the entity libraries in a random replacement mode, random number word generators of different expression modes are constructed for general entity types such as time, amount and the like, and the positions of time, amount and the like in the original labeled text data are randomly generated and replaced. The method solves the problem of data shortage in emerging industries and greatly expands the labeled data set automatically, and meanwhile, the model can learn the language knowledge of the context better, and compared with a common method, the robustness and the accuracy of the model are improved.

(2) The invention adopts an increment learning mode, predicts another batch of unlabeled text data by using a model trained by a part of labeled data, and makes the confidence coefficient of the algorithm higher than a threshold value D_tresholdThe predicted result of (1) is directly used as a mark, and data with low confidence coefficient is manually re-marked. The method greatly improves the efficiency of data marking and reduces the labor cost.

(3) Aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to carry out pruning operation on irrelevant information in the text. The extracted entities are taken as semantic core words, pruning is carried out through syntax dependence, limited step length texts with dependence on the entities are reserved, the length of the analyzed texts can be shortened to the maximum extent while the original context structure is reserved, and unnecessary text noise caused by sparse entity distribution is removed. The training speed and the prediction accuracy of the model are greatly improved, and the speed disadvantage of the pre-training model on long sentence training reasoning can be well relieved.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for analyzing sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm is characterized by comprising the following steps:

2. The method for parsing sparse semantic relationship according to claim 1 by combining the BilSTM-CRF algorithm and the R-BERT algorithm, wherein the semi-supervised labeling of the text data comprises:

3. The method for analyzing sparse semantic relationship by combining the BilSTM-CRF algorithm and the R-BERT algorithm according to claim 1, wherein the extraction of the entities contained in the text data to be predicted by the trained BilSTM-CRF algorithm model comprises:

4. The method for resolving sparse semantic relationships combining the BilSTM-CRF algorithm and the R-BERT algorithm according to claim 1 or 3, further comprising:

acquiring entities extracted from text data, and constructing an entity library according to the entities;

5. The method for parsing sparse semantic relationship according to claim 1 by combining the BilSTM-CRF algorithm and the R-BERT algorithm, wherein if the text data is an article, predicting the relationship between the text data to be predicted and the entity by using a trained R-BERT algorithm model comprises: