CN112417888A - Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm - Google Patents

Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm Download PDF

Info

Publication number
CN112417888A
CN112417888A CN202011337268.6A CN202011337268A CN112417888A CN 112417888 A CN112417888 A CN 112417888A CN 202011337268 A CN202011337268 A CN 202011337268A CN 112417888 A CN112417888 A CN 112417888A
Authority
CN
China
Prior art keywords
text data
entities
algorithm
bert
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011337268.6A
Other languages
Chinese (zh)
Inventor
陆佃龙
王增林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Netspectrum Data Technology Co ltd
Original Assignee
Jiangsu Netspectrum Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Netspectrum Data Technology Co ltd filed Critical Jiangsu Netspectrum Data Technology Co ltd
Priority to CN202011337268.6A priority Critical patent/CN112417888A/en
Publication of CN112417888A publication Critical patent/CN112417888A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for analyzing a sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm, which is characterized in that text data of a new industry is obtained through a web crawler, and semi-supervised labeling is carried out on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention provides a high-precision semantic relation extraction method aiming at information extraction of unstructured texts, which extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.

Description

Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm
Technical Field
The invention relates to the technical field of computers, in particular to a semantic relation analysis method based on a deep learning and pre-training model and an efficient semi-supervised semantic entity relation labeling construction method, and particularly relates to a method for analyzing a sparse semantic relation by combining a BilSTM-CRF algorithm and an R-BERT algorithm.
Background
With the rapid development of the internet, the analysis and processing of mass internet data become a vital task in each industry, mass text data in the internet, especially unstructured text data, contains a large amount of important information, but also contains a large amount of noise and irrelevant information, the distribution of effective entities is sparse, and how to extract data information in unstructured text with high precision becomes important in industry analysis.
Moreover, the relationship between entities is mined in mass data by virtue of a large amount of human resources in the text data relationship extraction in an artificial mode, the traditional machine learning mode is difficult to have higher precision in semantic analysis, and meanwhile, the shortage of labeled data and higher labeling cost also become factors influencing the extraction precision of the semantic entity relationship.
Market data in the internet unstructured article are distributed sparsely, and irrelevant information brings great noise to judgment of the relation between the market data value and the product, time and the like specified by the market data value. The method is characterized in that entity semantic analysis is carried out on articles in emerging industries (such as smart home, Internet of things, 5G and the like) to extract industry key data, a training data set is difficult to construct and train due to the fact that the number of articles is small when the training data is directly hooked with the data of the emerging industries, and how to utilize the data set reported by articles in the traditional industries to carry out data enhancement and migration solves the problem of 'cold start' of data extraction in the emerging industries, and becomes the key of semantic entity relationship analysis.
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention provides a method for analyzing sparse semantic relationships by combining the BiLSTM-CRF algorithm and the R-BERT algorithm, so as to solve the problems in the prior art.
In order to achieve the above objects and other related objects, the present invention provides a method for analyzing sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm, comprising the following steps:
acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data;
preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to the training data set and the verification data set;
extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model;
predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities;
and extracting the triad pair of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected.
Optionally, performing semi-supervised labeling on the text data, including:
training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model;
and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value.
Optionally, extracting entities included in text data to be predicted through the trained BilSTM-CRF algorithm model, including:
modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional LSTM model;
and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data.
Optionally, the method further comprises: acquiring entities extracted from text data, and constructing an entity library according to the entities;
based on the entity library, replacing entities of the same type in the traditional industry data in a random replacement mode, and constructing random number word generators of different expression modes aiming at the types of the general entities;
and randomly generating and replacing the general entities in the original labeled text data through the random number word generator, and expanding the labeled data.
Optionally, if the text data is an article, predicting a relationship between the text data to be predicted and the entity through the trained R-BERT algorithm model, including:
acquiring a BERT original model, adopting [ CLS ] marks to represent the characteristics of the whole sentence types in the article through the BERT original model, and using [ SEP ] to segment a plurality of sentences in the input article; and the number of the first and second groups,
coding by combining BERT input with an upstream extracted entity and adopting a structure of [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] };
connecting the sentence vector, the subject entity vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein, H is useds=[hs1,hs1+1,...,hs2]Representing a subject entity vector; using Ho=[ho1,ho1+1,...,ho2]Representing a guest entity vector.
As mentioned above, the invention provides a method for analyzing sparse semantic relationship by combining the BilSTM-CRF algorithm and the R-BERT algorithm, which has the following beneficial effects:
acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of a framework according to an embodiment;
FIG. 3 is a schematic modeling flow diagram provided by an embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1 to 3, the present invention provides a method for analyzing sparse semantic relationship by combining the BiLSTM-CRF algorithm and the R-BERT algorithm, which includes the following steps:
s100, acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data;
s200, preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set;
s300, extracting entities contained in the text data to be predicted through the trained BilSTM-CRF algorithm model;
s400, predicting the relation between the text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities;
and S500, extracting the triad pair of semantic relations of the text data to be tested according to the established relation connection, and completing semantic analysis of the text data to be tested.
The method comprises the steps of acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to a training data set and a verification data set; extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model; predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities; and extracting the triad pairs of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model.
According to the above description, in an exemplary embodiment, the semi-supervised labeling of the text data includes: training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model; and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value. The invention provides a semi-supervised text data entity and relationship labeling process. Aiming at the entity semantic relation analysis of the market text materials in the new field and the emerging industry in the Internet, the training data set labeling method combining the novel remote supervision mode and the incremental learning mode based on data migration and enhancement is innovatively provided. Specifically, for the situation that the emerging industry market analysis articles are lacked and the number of text sentences containing the required data is small, the project adopts a mode of constructing the emerging industry entity library to enhance the market analysis data set of the traditional industry. By constructing entity libraries of products, companies and the like in emerging industries, for traditional industry data, entities of the same type are replaced from the entity libraries in a random replacement mode, random number word generators of different expression modes are constructed for general entity types such as time, amount and the like, and positions of time, amount and the like in original labeled text data are randomly generated and replaced. The method solves the problem of data shortage in emerging industries and greatly expands the labeled data set automatically, and meanwhile, the model can learn the language knowledge of the context better, and compared with a common method, the robustness and the accuracy of the model are improved. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, wherein the confidence coefficient of the algorithm is higher than a threshold value DtresholdThe predicted result of (1) is directly used as a mark, and data with low confidence coefficient is manually re-marked. The invention provides a migration method for parsing semantic relation of text entities, which can fully migrate among data analysis of various industries, effectively utilize historical labeled data, and greatly reduce manual labeling on new tasks, especially on industrial data analysis in emerging fieldsThe marking task in the emerging industry with less data volume is effectively solved. The method is different from a general full-manual labeling process, adopts a semi-supervised data labeling process combining remote supervision and incremental learning, manually labels a small amount of data, adds the labeled semantic relation into a remote knowledge base, and directly labels the semantic entity relation appearing in the knowledge base in a subsequent text in a remote supervision mode in new data as a label. And carrying out entity replacement on the marked text data by using the constructed entity library and the time and money random generator, and enhancing the data set. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, directly using a prediction result with high confidence coefficient of the algorithm as a label, and manually re-labeling the data with low confidence coefficient. The method greatly improves the efficiency of data marking and reduces the labor cost.
According to the above description, in an exemplary embodiment, the extracting, by using the trained BiLSTM-CRF algorithm model, the entities contained in the text data to be predicted includes: modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional LSTM model; and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data. In the embodiment, named entity recognition is performed on the article by using a deep learning sequence tagging model for the tagged article, so that entities existing in the article are recognized. And (3) modeling the sequence forward and backward by using a bidirectional LSTM model, and scoring the whole prediction path by using a relation between Conditional Random Field (CRF) constraint label results so as to extract entities contained in the article sentences.
According to the above description, in an exemplary embodiment, if the text data is an article, predicting the relationship between the text data to be predicted and the entity through the trained R-BERT algorithm model includes: obtaining a BERT original model, and adopting [ CLS ] through the BERT original model]The tokens represent features of the overall type of sentence in the article and use SEP]Segmenting a plurality of sentences in an input article; and, by pumping the input of BERT upstreamCombining the obtained entities by using { [ CLS { []Article sentence [ SEP]Subject [ object ]][SEP]The structure of the code is coded; connecting the sentence vector, the subject entity vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein, H is useds=[hs1,hs1+1,...,hs2]Representing a subject entity vector; using Ho=[ho1,ho1+1,...,ho2]Representing a guest entity vector. The method trains an entity recognition model and a relation prediction model respectively through labeled data. Constructing a BilSTM-CRF, training entity identification, modeling sequences forward and backward by using a bidirectional LSTM model, and constraining the relation between label results by using a Conditional Random Field (CRF); constructing an R-BERT model as a relational analysis model, pruning the dependency relationship of the entities extracted in the previous step of the data text as a semantic center, removing unnecessary text noise caused by sparse entity distribution, and processing the model forward input data into { [ CLS ]]Article sentence [ SEP]Subject [ object ]][SEP]The structure in the form of the Chinese character } trains the R-BERT model; predicting entities appearing in the text by the data to be predicted through an entity recognition model BiLSTM-CRF; and respectively combining the entities extracted in the last step, integrating the entities with the original text according to the format of the training data, and inputting a model to predict the relationship among the entities. The invention provides a method for analyzing semantic relation between entities by using R-BERT. BERT uses a multi-layer self-attention mechanism to carry out bidirectional coded representation on text, and semantic syntax information of different levels of the text is extracted from low to high. The Model is pre-trained by using mass text data, and the effect of training the Language Model is achieved by shielding 15% of words in the text and predicting the shielded words in a Masked mode by using a Masked Language Model. Migration learning using BERT can provide strong support for downstream tasks in general. BERT original model adopted [ CLS]The mark represents the integral type characteristic of the sentence, SEP is used for dividing a plurality of input sentences, R-BERT structure is innovatively proposed for semantic relation analysis aiming at a special input structure in BERT, and by combining the input of the BERT with an upstream extracted entity, { [ CLS ] is adopted]Article sentence [ SEP]Subject [ object ]][SEP]Structure ofAnd (5) line coding. Using Hs=[hs1,hs1+1,...,hs2]Vectors representing subject entities, using Ho=[ho1,ho1+1,...,ho2]A vector representing an object entity. And finally, predicting the relation type through full connection and softmax by connecting the sentence vector, the subject entity vector and the object entity vector.
h′=W[concat(H,Hs,Ho)]+b;
p=softmax(h′)。
Aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to carry out pruning operation on irrelevant information in the text, the extracted entities are taken as semantic core words, pruning is carried out through syntactic dependency, limited step length texts with dependency relationship with the entities are reserved, and unnecessary text noise caused by sparse entity distribution is removed. The method can effectively solve the problem of sparse entity distribution and improve the accuracy of entity relationship analysis. And adding the entity pairs into the model in an effective mode, and extracting the characteristics to predict the relation between the entities. Meanwhile, in order to prevent the occurrence of overfitting, the entity in the sentence is not seen by the model, and the entity pair to be predicted is shielded in the sentence [ MASK ]. Compared with the existing relation extraction mode, the method has obvious effect improvement. The invention discloses a method for removing text noise by using syntax dependence analysis, which comprises the following steps: aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to extract semantic center word entities, limited step length texts with dependency relations with the entities are reserved, and unnecessary text noise caused by sparse entity distribution is removed.
In summary, the invention innovatively provides a semantic relation data set construction mode and a semantic relation extraction method in a semi-supervised mode aiming at information extraction of unstructured texts based on a natural language processing technology under deep learning. Firstly, extracting product entities, money amount, time, place, mechanism and other entities contained in a text through an algorithm model of BilSTM-CRF, predicting the relation between the entities through an R-BERT model for the text and the extracted entities, establishing relation connection between related entities, extracting a triplet group of semantic relations, and completing text semantic analysis. The invention discloses a semi-supervised text data entity and relation labeling process and an automatic information extraction method based on deep learning natural language processing based on the semantic relation analysis and extraction problem of an unstructured document. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BilSTM-CRF, and predicts the relation between the entities by using the texts and the extracted entities through an R-BERT model. The invention is a natural language analysis process combining natural language processing with deep learning pre-training model and having better prediction effect through practice exploration research, the algorithm is efficient, the pertinence is strong, and the process flow highly fits data analysis service. In the engineering application practice, compared with the rule-based extraction process and the general technical method commonly adopted by the text data mining project, the project has the advantages of obviously higher originality, creativity and benefit. Moreover, the invention has the following three advantages:
(1) aiming at the problem that the number of articles is small due to the fact that training data of emerging industries (such as smart home, the Internet of things, 5G and the like) are constructed directly and are hooked with the emerging industries, a novel training data set labeling method combining a remote supervision mode and an incremental learning mode based on data migration and enhancement is innovatively provided. And enhancing the market quotation analysis data set of the traditional industry by adopting a mode of constructing an emerging industry entity library. By constructing entity libraries of products, companies and the like in emerging industries, for the labeled data of the traditional industries, entities of the same type are replaced from the entity libraries in a random replacement mode, random number word generators of different expression modes are constructed for general entity types such as time, amount and the like, and the positions of time, amount and the like in the original labeled text data are randomly generated and replaced. The method solves the problem of data shortage in emerging industries and greatly expands the labeled data set automatically, and meanwhile, the model can learn the language knowledge of the context better, and compared with a common method, the robustness and the accuracy of the model are improved.
(2) The invention adopts an increment learning mode, predicts another batch of unlabeled text data by using a model trained by a part of labeled data, and makes the confidence coefficient of the algorithm higher than a threshold value DtresholdThe predicted result of (1) is directly used as a mark, and data with low confidence coefficient is manually re-marked. The method greatly improves the efficiency of data marking and reduces the labor cost.
(3) Aiming at the sparse distribution characteristic of entities in the text, a syntactic dependency analysis mode is innovatively adopted to carry out pruning operation on irrelevant information in the text. The extracted entities are taken as semantic core words, pruning is carried out through syntax dependence, limited step length texts with dependence on the entities are reserved, the length of the analyzed texts can be shortened to the maximum extent while the original context structure is reserved, and unnecessary text noise caused by sparse entity distribution is removed. The training speed and the prediction accuracy of the model are greatly improved, and the speed disadvantage of the pre-training model on long sentence training reasoning can be well relieved.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (5)

1. A method for analyzing sparse semantic relationship by combining a BilSTM-CRF algorithm and an R-BERT algorithm is characterized by comprising the following steps:
acquiring text data of emerging industries through a web crawler, and carrying out semi-supervised labeling on the text data;
preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BilSTM-CRF algorithm model and an R-BERT algorithm model according to the training data set and the verification data set;
extracting entities contained in text data to be predicted through a trained BilSTM-CRF algorithm model;
predicting the relation between text data to be predicted and the entities through the trained R-BERT algorithm model, and establishing relation connection between related entities;
and extracting the triad pair of the semantic relations of the text data to be detected according to the established relation connection, and completing the semantic analysis of the text data to be detected.
2. The method for parsing sparse semantic relationship according to claim 1 by combining the BilSTM-CRF algorithm and the R-BERT algorithm, wherein the semi-supervised labeling of the text data comprises:
training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model;
and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value.
3. The method for analyzing sparse semantic relationship by combining the BilSTM-CRF algorithm and the R-BERT algorithm according to claim 1, wherein the extraction of the entities contained in the text data to be predicted by the trained BilSTM-CRF algorithm model comprises:
modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional LSTM model;
and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data.
4. The method for resolving sparse semantic relationships combining the BilSTM-CRF algorithm and the R-BERT algorithm according to claim 1 or 3, further comprising:
acquiring entities extracted from text data, and constructing an entity library according to the entities;
based on the entity library, replacing entities of the same type in the traditional industry data in a random replacement mode, and constructing random number word generators of different expression modes aiming at the types of the general entities;
and randomly generating and replacing the general entities in the original labeled text data through the random number word generator, and expanding the labeled data.
5. The method for parsing sparse semantic relationship according to claim 1 by combining the BilSTM-CRF algorithm and the R-BERT algorithm, wherein if the text data is an article, predicting the relationship between the text data to be predicted and the entity by using a trained R-BERT algorithm model comprises:
acquiring a BERT original model, adopting [ CLS ] marks to represent the characteristics of the whole sentence types in the article through the BERT original model, and using [ SEP ] to segment a plurality of sentences in the input article; and the number of the first and second groups,
coding by combining BERT input with an upstream extracted entity and adopting a structure of [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] };
connecting the sentence vector, the subject entity vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein, H is useds=[hs1,hs1+1,...,hs2]Representing a subject entity vector; using Ho=[ho1,ho1+1,...,ho2]Representing a guest entity vector.
CN202011337268.6A 2020-11-26 2020-11-26 Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm Withdrawn CN112417888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011337268.6A CN112417888A (en) 2020-11-26 2020-11-26 Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011337268.6A CN112417888A (en) 2020-11-26 2020-11-26 Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm

Publications (1)

Publication Number Publication Date
CN112417888A true CN112417888A (en) 2021-02-26

Family

ID=74843741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011337268.6A Withdrawn CN112417888A (en) 2020-11-26 2020-11-26 Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm

Country Status (1)

Country Link
CN (1) CN112417888A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905746A (en) * 2021-03-08 2021-06-04 国能大渡河流域水电开发有限公司 System archive knowledge mining processing method based on knowledge graph technology
CN112926313A (en) * 2021-03-10 2021-06-08 新华智云科技有限公司 Method and system for extracting slot position information
CN113094482A (en) * 2021-03-29 2021-07-09 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN113220871A (en) * 2021-05-31 2021-08-06 北京语言大学 Literature character relation identification method based on deep learning
CN113268740A (en) * 2021-05-27 2021-08-17 四川大学 Input constraint completeness detection method of website system
CN113312486A (en) * 2021-07-27 2021-08-27 中国电子科技集团公司第十五研究所 Signal portrait construction method and device, electronic equipment and storage medium
CN116523555A (en) * 2023-05-12 2023-08-01 珍岛信息技术(上海)股份有限公司 Clue business opportunity insight system based on NLP text processing technology

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905746A (en) * 2021-03-08 2021-06-04 国能大渡河流域水电开发有限公司 System archive knowledge mining processing method based on knowledge graph technology
CN112926313A (en) * 2021-03-10 2021-06-08 新华智云科技有限公司 Method and system for extracting slot position information
CN112926313B (en) * 2021-03-10 2023-08-15 新华智云科技有限公司 Method and system for extracting slot position information
CN113094482A (en) * 2021-03-29 2021-07-09 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN113094482B (en) * 2021-03-29 2023-10-17 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN113268740A (en) * 2021-05-27 2021-08-17 四川大学 Input constraint completeness detection method of website system
CN113220871A (en) * 2021-05-31 2021-08-06 北京语言大学 Literature character relation identification method based on deep learning
CN113220871B (en) * 2021-05-31 2023-10-20 山东外国语职业技术大学 Literature character relation recognition method based on deep learning
CN113312486A (en) * 2021-07-27 2021-08-27 中国电子科技集团公司第十五研究所 Signal portrait construction method and device, electronic equipment and storage medium
CN113312486B (en) * 2021-07-27 2021-11-16 中国电子科技集团公司第十五研究所 Signal portrait construction method and device, electronic equipment and storage medium
CN116523555A (en) * 2023-05-12 2023-08-01 珍岛信息技术(上海)股份有限公司 Clue business opportunity insight system based on NLP text processing technology

Similar Documents

Publication Publication Date Title
CN112417888A (en) Method for analyzing sparse semantic relationship by combining BilSTM-CRF algorithm and R-BERT algorithm
CN110489555B (en) Language model pre-training method combined with similar word information
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN112528034B (en) Knowledge distillation-based entity relationship extraction method
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN112149421A (en) Software programming field entity identification method based on BERT embedding
CN109492113A (en) Entity and relation combined extraction method for software defect knowledge
CN113408288A (en) Named entity identification method based on BERT and BiGRU-CRF
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN111651974A (en) Implicit discourse relation analysis method and system
CN109977205A (en) A kind of method of computer autonomous learning source code
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
WO2022226716A1 (en) Deep learning-based java program internal annotation generation method and system
CN113204967B (en) Resume named entity identification method and system
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN116383399A (en) Event public opinion risk prediction method and system
CN114358201A (en) Text-based emotion classification method and device, computer equipment and storage medium
CN113868432A (en) Automatic knowledge graph construction method and system for iron and steel manufacturing enterprises
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN113221569A (en) Method for extracting text information of damage test
CN113505583A (en) Sentiment reason clause pair extraction method based on semantic decision diagram neural network
CN115935995A (en) Knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method
CN115391570A (en) Method and device for constructing emotion knowledge graph based on aspects
CN117196032A (en) Knowledge graph construction method and device for intelligent decision, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210226