CN112749549B - Chinese entity relation extraction method based on incremental learning and multi-model fusion - Google Patents

Chinese entity relation extraction method based on incremental learning and multi-model fusion Download PDF

Info

Publication number
CN112749549B
CN112749549B CN202110091226.7A CN202110091226A CN112749549B CN 112749549 B CN112749549 B CN 112749549B CN 202110091226 A CN202110091226 A CN 202110091226A CN 112749549 B CN112749549 B CN 112749549B
Authority
CN
China
Prior art keywords
model
relation
entity
training
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110091226.7A
Other languages
Chinese (zh)
Other versions
CN112749549A (en
Inventor
金康荣
胡岩峰
刘洋
时聪
顾爽
刘午凌
付啟明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences
Original Assignee
Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences filed Critical Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences
Priority to CN202110091226.7A priority Critical patent/CN112749549B/en
Publication of CN112749549A publication Critical patent/CN112749549A/en
Application granted granted Critical
Publication of CN112749549B publication Critical patent/CN112749549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a Chinese entity relation extraction method based on incremental learning and multi-model fusion, which comprises the steps of pre-training a word vector model, an entity recognition model and a dependency syntactic analysis model, and initializing a relation data cluster; obtaining an incremental learning sample set of an expanded relation data cluster, obtaining an entity set of the sample by utilizing an entity recognition model, extracting subjects, predicates and objects of each sentence in the sample by utilizing a dependency syntactic analysis model, converting predicates in the sentences into word vectors by utilizing a word vector model, projecting the word vectors into the relation data cluster, and then continuously expanding the data volume of each relation data cluster in an incremental learning mode to finally obtain the expanded relation data cluster; and acquiring a test sample set extracted by the Chinese entity relationship, and combining the pre-training model and the expanded relationship data cluster to determine the relationship data type corresponding to the test sample so as to finish the extraction of the Chinese entity relationship. The application does not need a large amount of manual labeling, has strong expansion capability and high generalization degree.

Description

Chinese entity relation extraction method based on incremental learning and multi-model fusion
Technical Field
The application relates to the field of natural language processing, in particular to a Chinese entity relation extraction method based on incremental learning and multi-model fusion.
Background
In the internet era, a large amount of information appears at any moment, so that the mass users are faced with huge and disordered data, and are sometimes overwhelmed, time is usually spent to carefully read and understand, so that valuable information is extracted from unstructured information, the users are helped to quickly find information beneficial to themselves, and an automatic extraction mode is urgently needed, and information extraction technology is generated under the background.
Information extraction refers to extracting valuable information from a large amount of unstructured text information and converting the valuable information into structured data for storage, so that the user can conveniently further analyze and use the valuable information. The relation extraction is a very important technology in the field of information extraction, can automatically extract entity pairs in texts and relations among the entity pairs to form a triplet form, can help a user to obtain high-value information of the texts from massive data, can quickly understand the interrelationships among the information, and has important significance for construction of knowledge graphs and question-answering systems.
Most of relation extraction is based on a supervised learning or rule method, and a professional is usually required to manually label data, so that a great deal of time and labor cost are often spent, and the labeled data usually have a certain error to influence the subsequent algorithm model training. Moreover, training data sets used in the conventional relational extraction method are generally specific to a specific field, cannot be used in a generic way, and are difficult to apply in large-scale engineering. In addition, the relationship extraction model generated in the traditional way is often limited by original training data, and is not effectively utilized in the face of increasingly new data, and is lack of updating and expandability.
Disclosure of Invention
The application aims to provide a Chinese entity relation extraction method based on incremental learning and multi-model fusion, which aims to solve the problems that the existing relation extraction method needs a large amount of manual labeling, is limited to a specific field, does not have continuous expansibility, has poor generalization capability and is low in prediction accuracy.
The technical solution for realizing the purpose of the application is as follows: a Chinese entity relation extraction method based on incremental learning and multi-model fusion specifically comprises the following steps:
step 1: acquiring an external corpus of a Word2Vec pre-training model, and training by using a neural network algorithm to obtain a Word vector model;
step 2: obtaining an external corpus of an entity recognition pre-training model, and generating an entity recognition model by combining BiLSTM and CRF algorithms;
step 3: obtaining an external corpus of a dependency syntax analysis pre-training model, and generating a dependency syntax analysis model based on a dependency syntax analysis algorithm;
step 4: initializing a plurality of relation data clusters according to predefined basic relation categories among entities and basic relation words under each category;
step 5: obtaining an incremental learning sample set of an expanded relation data cluster, obtaining an entity set of the sample by utilizing an entity recognition model, extracting subjects, predicates and objects of each sentence in the sample by utilizing a dependency syntactic analysis model, converting predicates in the sentences into word vectors by utilizing a word vector model, projecting the word vectors into a plurality of relation data clusters initialized in the step 4, continuously expanding the data quantity of each relation data cluster by utilizing an incremental learning mode, and finally obtaining a plurality of relation data clusters which are expanded;
step 6: and 5, acquiring a test sample set extracted by Chinese entity relation, combining an entity set obtained by utilizing an entity recognition model, extracting subjects, predicates and objects of each sentence in the test sample by utilizing a dependency syntactic analysis model, converting predicates in the sentences into word vectors by utilizing a word vector model, projecting the word vectors into a plurality of relation data clusters expanded in the step 5, determining corresponding relation categories, and completing Chinese entity relation extraction.
Further, in step 1, an external corpus of a Word2Vec pre-training model is obtained, and a Word vector model is obtained through training by using a neural network algorithm and is recorded as M w2v The specific method comprises the following steps:
1.1, a training corpus is a Chinese wikipedia corpus, and a training data set is generated by performing operations of text content extraction, data processing and word segmentation on the corpus;
based on the training data set, a Skip-gram Model (Continuous Skip-gram Model) in a word2vec algorithm is used for training, the Model comprises an input layer neural network, a projection layer neural network and an output layer neural network, semantic information of a context is predicted through a current vocabulary, and the vocabulary probability is calculated through a formula (1):
P(w n-c ,w n-c+1 ,…,w n+c-1 ,w n+c |w n ) (1)
wherein w is n Representing the nth vocabulary, c is the size of a sliding window, in the training parameters, setting the dimension of a word vector to be 250 dimensions, the window size to be 5, finally generating a word2vec word vector model through training, and marking the model as M w2v
Further, in step 2, an external corpus of the entity recognition pre-training model is obtained, and an entity recognition model is generated by combining BiLSTM and CRF algorithms and is marked as M ee The specific method comprises the following steps:
2.1 training based on MSRA_NER training data set by combining BiLSTM algorithm and CRF algorithm, wherein BiLSTM algorithm is also called as bidirectional LSTM algorithm, its input is the output of word embedding layer, namely word vector obtained by the conversion of embedding layer after text word segmentation, and is marked as (w 1 ,w 2 ,…,w n ),w n Representing the nth vocabulary, the output of the forward LSTM is noted asThe output of reverse LSTM is denoted +.>Calculating the output of the final hidden layer according to equation (2):
2.2, the CRF layer is arranged behind the BiLSTM layer, and the output of the BiLSTM is restrained by learning a tag state transition probability matrix;
2.3, finally generating an entity recognition model through training, and marking as M ee
Further, in step 3, an external corpus of the dependency syntax analysis pre-training model is obtained, and a dependency syntax analysis model is generated based on a dependency syntax analysis algorithm, and the specific method comprises the following steps:
the training corpus is a Ha-Gong Chinese dependency corpus, the corpus is trained by using a dependency syntax analysis algorithm, the interdependence relationship among grammar components in sentences is learned, and finally a dependency syntax analysis model is generated and recorded as M dp
Further, in step 4, according to the predefined categories of basic relationships between entities and basic relationship vocabulary under each category, the relationship data cluster is initialized, and the specific method is as follows:
4.1, basic relationship class labels c= (C) between predefined entities 1 ,c 2 ,…,c m ) Wherein m is a relationship class number;
4.2 collecting and sorting basic relation vocabularies under each category, wherein the vocabulary number under each category is not less than 20, and the vocabulary number of each category is recorded as P= (P) 1 ,…,p i ,…,p m ) Wherein p is i Vocabulary number representing the i-th category;
4.3 using the word vector model M generated in step 1 w2v Converting the basic vocabulary under each relation category into word vectors, and recording the word vectors asFinally, m relational data clusters are formed and marked as CU= (CU) 1 ,…,cu i ,…,cu m ) Wherein, the method comprises the steps of, wherein,relational data cluster representing the ith category, the data amount in the cluster being p i L is the data dimension of the word vector;
further, in step 5, an incremental learning sample set of the expanded relational data clusters is obtained, an entity recognition model is utilized to obtain an entity set of the samples, subjects, predicates and objects of each sentence in the samples are extracted by utilizing a dependency syntax analysis model, predicates in the sentences are converted into word vectors by utilizing a word vector model, the word vectors are projected into the plurality of relational data clusters initialized in step 4, then the data volume of each relational data cluster is continuously expanded by an incremental learning mode, and finally a plurality of expanded relational data clusters are obtained, and the specific method is as follows:
5.1, taking a Chinese text corpus of the search fox news as an increment learning sample set of an expansion relation data cluster, and storing the content in a TXT format, wherein the content is recorded as phi= (T) 1 ,T 2 ,…,T n ) Wherein n is the number of samples;
5.2 for each text T in the sample set i Using the entity recognition model M generated in step 2 ee Extracting the entities, performing operations of de-duplication and filtering the stop words to obtain an entity set, and marking the entity set as E;
5.3 for text T i Performing sentence processing;
5.4, for each sentence in the text, using the dependency syntax analysis model M generated in step 3 dp Extracting subjects, predicates and objects of sentences to form a triplet form, and marking the triplet form as (S, V, O);
5.5, judging whether the subject S and the object O in the triplet exist in the entity set E or not, and if so, continuing; if not, skipping;
5.6 using M generated in step 1 w2v The model converts the predicate V into a word vector V, matches the word vector V with m relational data clusters CU, and skips if the relational word vector data exists; if not, then relayContinuing;
5.7, calculating the similarity between the relation cluster and the ith relation cluster according to the formula (3):
where cos (-) represents the cosine similarity function between vectors,representing a word vector converted from a jth word under the ith relation class;
5.8 obtaining the relation data cluster category index corresponding to the maximum similarity according to the formula (4)
If it is the maximum similarityIs greater than or equal to the set similarity threshold +.>The word vector v is extended to the relational data cluster +.>In (i.e.)>If its maximum similarity is smaller than the threshold +.>Skipping;
and 5.9, continuously executing according to the incremental learning mode until all texts in the sample set phi are executed, storing all data and parameters, exiting iteration, and finally obtaining m relation data clusters CU with expanded and completed.
Further, in step 6, a test sample set extracted from a chinese entity relationship is obtained, and in combination with an entity set obtained from an entity recognition model, subjects, predicates and objects of each sentence in the test sample are extracted from a dependency syntax analysis model, predicates in the sentence are converted into word vectors by using a word vector model, and the word vectors are projected into a plurality of relationship data clusters expanded in step 5, so as to determine corresponding relationship types, thereby completing chinese entity relationship extraction, and the specific method is as follows:
6.1, obtaining a test sample set extracted by Chinese entity relation as ψ= (T) 1 ,T 2 ,…,T q ) Wherein q is the number of test samples;
6.2 for each text T in the test sample set i Using the entity recognition model M generated in step 2 ee Extracting the entities, performing operations of de-duplication and filtering the stop words to obtain an entity set, and marking the entity set as E;
6.3 for text T i Performing sentence processing;
6.4, for each sentence in the text, using the dependency syntax analysis model M generated in step 3 dp Extracting subjects, predicates and objects of sentences to form a triplet form, and marking the triplet form as (S, V, O);
6.5, judging whether the subject S and the object O in the triplet exist in the entity set E or not, and if so, continuing; if not, skipping the triplet;
6.6 using the word vector model M generated in step 1 w2v Converting predicate V into word vector V, projecting the word vector V into m relational data cluster CUs obtained in step 5, and calculating a relational cluster category index corresponding to the maximum similarity according to formula (3) and formula (4)Then the relation cluster category->As entity S and entity OThe relation between the sentences returns the relation triples existing in the sentences
And 6.7, continuously extracting the relation of the test data according to the mode until all texts are extracted, and returning all extraction results.
A Chinese entity relation extraction system based on incremental learning and multi-model fusion performs Chinese entity relation extraction based on incremental learning and multi-model fusion based on any one of the methods.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the preceding claims for incremental learning and multimodal fusion based extraction of chinese entity relationship when executing the computer program.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the claims for chinese entity relationship extraction based on incremental learning and multimodal fusion.
Compared with the prior art, the application has the remarkable advantages that: (1) The application learns a large amount of semantic information based on an external Chinese corpus to generate a word vector model so as to carry out high-quality semantic understanding on the vocabulary in relation extraction, thereby enhancing the accuracy and generalization capability of relation extraction and solving the problem that the traditional relation extraction method only learns the semantic information of a training set to cause insufficient semantic understanding and generalization capability.
(2) According to the method, the entity pairs and the relations of the text are extracted in a multi-model fusion mode, a traditional method based on a specific training set is replaced, the accuracy of relation extraction is remarkably improved through integrating the advantages of the multiple models, the relation training set does not need to be marked manually, the time and labor cost are greatly reduced, and the problem that the relation extraction accuracy is low due to errors caused by manual marking is solved.
(3) The application adopts the incremental learning algorithm to continuously expand and optimize the relation data cluster, thereby making up the problems that the traditional mode depends on a specific training data set, so that relation extraction is limited to a specific field, the generalization capability in other fields is insufficient and the accuracy is low.
Drawings
FIG. 1 is a schematic flow chart of the method of the present application;
FIG. 2 is a schematic diagram of an initialization relational data cluster of the present application;
FIG. 3 is a schematic diagram of an expanded relational data cluster using incremental learning in accordance with the present application;
FIG. 4 is a schematic diagram of the present application for prediction in combination with multiple models and relational data clusters.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
As shown in fig. 1, the present application is mainly divided into VI steps:
step I, pre-training a word vector model based on external corpus;
step II, pre-training an entity recognition model based on external corpus;
step III, pre-training a dependency syntactic analysis model based on an external corpus;
step IV, initializing a relational data cluster;
step V, using incremental learning to expand the relational data cluster;
step VI is to combine multiple models and relational clusters to make predictions.
The technical scheme of the application and the scientific principle according to the technical scheme are described in detail below.
The word vector model pre-training specific process in the step I is as follows:
1.1, the training corpus is a Chinese wikipedia corpus, and a training data set is generated by carrying out operations such as text content extraction, data processing, word segmentation and the like on the corpus.
Based on the training data set, a Skip-gram Model (Continuous Skip-gram Model) in a word2vec algorithm is used for training, the Model comprises an input layer neural network, a projection layer neural network and an output layer neural network, semantic information of a context is predicted through a current vocabulary, and the vocabulary probability is calculated through a formula (1):
P(w n-c ,w n-c+1 ,…,w n+c-1 ,w n+c |w n ) (1) wherein w n The nth word is represented and c is the size of the sliding window.
1.3, in the training parameters, setting the dimension of word vector as 250 dimension and the window size as 5, finally generating word2vec word vector model through training, and marking as M w2v
The specific process of pre-training the entity recognition model in the step II is as follows:
2.1 training based on MSRA_NER training data set, combining BiLSTM algorithm and CRF algorithm. The BiLSTM algorithm is also called as a bidirectional LSTM algorithm, can fully capture the semantic information of the context, and the input is the output of a word embedding layer, namely, word vectors obtained through the conversion of the embedding layer after text word segmentation are recorded as (w) 1 ,w 2 ,…,w n )(w n Representing the nth vocabulary), the output of the forward LSTM is noted asThe output of reverse LSTM is denoted +.>Calculating the output of the final hidden layer according to equation (2):
2.2, in order to fully utilize the state transition information of the tag, a CRF layer, namely a conditional random field, is added behind the BiLSTM layer, and the output of the BiLSTM is restrained by learning the state transition probability matrix of the tag, so that the rationality of predicting the tag is further improved.
2.3, finally generating an entity recognition model through training, and marking as M ee
The pre-training specific process of the dependency syntactic analysis model in the step III is as follows:
3.1, training the corpus into a Hai-Gong Chinese dependency corpus, training the corpus by using a dependency syntax analysis algorithm, learning the interdependence relationship among grammar components in sentences, and finally generating a dependency syntax analysis model which is recorded as M dp
The specific process of initializing the relational data cluster in step IV is as follows (as shown in fig. 2):
4.1, assuming m is a relationship class number, firstly predefining a basic relationship class label C= (C) between entities 1 ,c 2 ,…,c m ) And covers the relationship types that can occur in most application scenarios among entities.
4.2 collecting and sorting basic relation vocabularies under each category, wherein the vocabulary number under each category is not less than 20, and the vocabulary number of each category is recorded as P= (P) 1 ,…,p i ,…,p m ) Wherein p is i Representing the vocabulary of the i-th category.
4.3 using M generated in step I w2v The word vector model converts the basic vocabulary under each relation category into word vectors (250 dimensions) and records the word vectors asFinally, m relational data clusters are formed and marked as CU= (CU) 1 ,…,cu i ,…,cu m ) Wherein->Relational data cluster representing the ith category, the data amount in the cluster being p i Each data is 250 dimensions.
The specific process of using incremental learning to augment the relational data clusters in step V is as follows (as shown in fig. 3):
5.1 New fox searchingThe smelling Chinese text corpus is used as an incremental learning sample set of an expansion relation data cluster, the content is stored in a TXT format, and is recorded as phi= (T) 1 ,T 2 ,…,T n ) (n is the number of samples).
5.2 for each text T in the sample set i Entity recognition model M generated by using step II ee Extracting the entity, wherein the entity type comprises: persona, institution, country and place. And then, performing operations such as de-duplication, stop word filtering and the like on the extracted entities to obtain an entity set, and marking the entity set as E.
5.3 for text T i And performing sentence processing.
5.4, for each sentence in the text, using the dependency syntax analysis model M generated in step III dp The subject, predicate and object of the sentence are extracted to form a triplet, which is denoted as (S, V, O).
5.5, judging whether the subject S and the object O in the triplet exist in the entity set E. If so, continuing; if not, skipping.
5.6 using M generated in step I w2v The model converts the predicate V into a word vector V, matches the word vector V with m relational data clusters CU, and skips if the relational word vector data exists; if not, continuing.
5.7, calculating the similarity between the relation cluster and the ith relation cluster according to the formula (3):
where cos () represents the cosine similarity function between vectors.
5.8 obtaining the relation data cluster category index corresponding to the maximum similarity according to the formula (4)
If it is the maximum similarityIs greater than or equal to the set similarity threshold +.>The word vector v is extended to the relational data cluster +.>In (i.e.)>If its maximum similarity is smaller than the threshold +.>Then skip.
And 5.9, continuously executing according to the incremental learning mode until all texts in the sample set phi are executed, storing all data and parameters, exiting iteration, and finally obtaining m relation data clusters CU with expanded and completed.
The specific process of predicting in step VI in combination with multiple models and relational clusters is as follows (as shown in fig. 4):
6.1, assume that the test sample set extracted by the Chinese entity relation is: ψ= (T 1 ,T 2 ,…,T q ) (q is the number of test samples).
6.2, for each text T i The same procedure as in steps V5.2-5.4 was used to obtain the real object set E and the main predicate-guest triplet for each sentence, denoted (S, V, O).
6.3, judging whether the subject S and the object O in the triplet exist in the entity set E. If so, continuing; if not, skipping.
6.4 using M generated in step I w2v The predicate V is converted into a word vector V by the word vector model, and projected into the relationship data cluster CU expanded in the step V, and the maximum similarity corresponding to the predicate V is calculated according to the Chinese formula (3) and the Chinese formula (4) in the step VRelation cluster category indexThen the relation cluster category->As a relationship between entity S and entity O, a relationship triplet is returned
And 6.5, continuously extracting the relation of the test data according to the mode until all texts are extracted, and returning all extraction results.
The application also provides a Chinese entity relation extraction system based on the incremental learning and the multi-model fusion, which is used for extracting the Chinese entity relation based on the incremental learning and the multi-model fusion based on the method.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for chinese entity relationship extraction based on incremental learning and multimodal fusion when executing the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method for chinese entity relationship extraction based on incremental learning and multimodal fusion.
Examples
In order to verify the effectiveness of the inventive protocol, the following simulation experiments were performed.
Input: external corpus of three pre-training models (word vector model, entity recognition model, dependency syntactic analysis model), incremental learning sample set phi= [ T of expanded relation data cluster 1 ,T 2 ,…,T n ](n is the number of samples), the test sample set ψ= (T) of chinese entity relation extraction 1 ,T 2 ,…,T q ) (q is the number of test samples), the base relationship category c= (C) between predefined entities 1 ,c 2 ,…,c m ) (m is the number of relationship classes) and basic relationship vocabulary under each class.
Step 1: three external corpus based on word vector model, entity recognition model and dependency syntax analysis model are used for generating three pre-training models, namely word vector model M, by using the modes in the steps I-III w2v Entity recognition model M ee Dependency syntax analysis model M dp
Step 2: based on the basic relation category c= (C) between entities 1 ,c 2 ,…,c m ) And initializing a relational data cluster according to the following steps of:
step 2.1: obtaining m relational data clusters Cu= (CU) by using the word vector conversion method of 4.2 in the step IV 1 ,…,cu i ,…,cu m ) Wherein, the method comprises the steps of, wherein,relational data cluster representing the ith category, the data amount in the cluster being p i Each data is 250 dimensions.
Step 3: incremental learning sample set phi= [ T ] based on extended data cluster 1 ,T 2 ,…,T n ]Expanding the data cluster according to the following steps:
step 3.1: obtaining each text T by using the recognition mode of 5.2 in the step V i Entity set E in (a).
Step 3.2: for text T i And carrying out clauses.
Step 3.3: the subject, predicate and object in the sentence component are obtained by using the extraction method of 5.4 in the step V and marked as (S, V, O) triples.
Step 3.4: it is determined whether both subject S and object O in the triplet are present in the resulting set of entities E in Step 3.1. If so, entering Step 3.5; if not, ignore and proceed to Step 3.3 for continued analysis.
Step 3.5: using M generated in Step 1 w2v The predicate V is converted into a word vector V by the word vector model, and the word vector V is matched with m relational data clusters CU, if the relational word vector data exists, the predicate V is ignoredSlightly and continuously analyzing in Step 3.3; if not, go to Step 3.6.
Step 3.6: and (3) obtaining the similarity between the predicate vector V and the ith relation cluster by using the calculation mode of 5.7 in the step V.
Step 3.7: obtaining the relation data cluster category index corresponding to the maximum similarity by using the calculation mode of 5.8 in the step VIf its maximum similarity->Greater than or equal to the set threshold->The predicate word vector v is extended to the relational data cluster +.>In (i.e.)>If its maximum similarity is smaller than the threshold +.>Then ignore and go to Step 3.3 for continued analysis.
Step 3.8: and continuously executing according to the incremental learning modes Step 3.1-Step 3.7 until all texts in the sample set phi are learned, storing all data and parameters, and exiting iteration to obtain m expanded relational data clusters CU. Otherwise, go on to Step 3.1.
Step 4: based on the test sample set ψ= (T 1 ,T 2 ,…,T q ) The model M is extracted using the relationship according to the following steps RE And (3) predicting:
step 4.1: for each text T i The same procedure as in Steps 3.1 to 3.3 is used to obtain the real set E and the main predicate (S, V, O) triples of each sentence.
Step 4.2: it is determined whether subject S and object O in the triplet exist in the entity set E generated in Step 4.1. If so, entering Step 4.3; if not, the method is ignored.
Step 4.3: using M generated in Step 1 w2v The predicate is converted into a word vector V by the word vector model, and the word vector V is projected into a relational data cluster obtained by Step 3.8, and the category index of the relational cluster corresponding to the maximum similarity is calculated according to the modes of the formulas (3) and (4) in the Step VThen the relation cluster category->As a relation between entity S and entity O, a relation triplet is returned in the form +.>
Step 4.4: and continuously extracting the relation of the test data according to the modes Step 4.1-Step 4.3 until all the test texts are extracted, exiting, and returning all extraction results.
And (3) outputting: the results are extracted from all relationships of the test sample set.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A Chinese entity relation extraction method based on incremental learning and multi-model fusion is characterized by comprising the following steps:
step 1: acquiring an external corpus of a Word2Vec pre-training model, and training by using a neural network algorithm to obtain a Word vector model;
step 2: obtaining an external corpus of an entity recognition pre-training model, and generating an entity recognition model by combining BiLSTM and CRF algorithms;
step 3: obtaining an external corpus of a dependency syntax analysis pre-training model, and generating a dependency syntax analysis model based on a dependency syntax analysis algorithm;
step 4: initializing a plurality of relation data clusters according to predefined basic relation categories among entities and basic relation words under each category;
step 5: obtaining an incremental learning sample set of an expanded relation data cluster, obtaining an entity set of the sample by utilizing an entity recognition model, extracting subjects, predicates and objects of each sentence in the sample by utilizing a dependency syntactic analysis model, converting predicates in the sentences into word vectors by utilizing a word vector model, projecting the word vectors into a plurality of relation data clusters initialized in the step 4, continuously expanding the data quantity of each relation data cluster by utilizing an incremental learning mode, and finally obtaining a plurality of relation data clusters which are expanded;
step 6: and 5, acquiring a test sample set extracted by Chinese entity relation, combining an entity set obtained by utilizing an entity recognition model, extracting subjects, predicates and objects of each sentence in the test sample by utilizing a dependency syntactic analysis model, converting predicates in the sentences into word vectors by utilizing a word vector model, projecting the word vectors into a plurality of relation data clusters expanded in the step 5, determining corresponding relation categories, and completing Chinese entity relation extraction.
2. The method for extracting Chinese entity relationship based on incremental learning and multimodal fusion as defined in claim 1, wherein in step 1, an external corpus of Word2Vec pre-training models is obtained and a neural network is usedTraining the algorithm to obtain a word vector model which is marked as M w2v The specific method comprises the following steps:
1.1, a training corpus is a Chinese wikipedia corpus, and a training data set is generated by performing operations of text content extraction, data processing and word segmentation on the corpus;
based on the training data set, a Skip-gram Model (Continuous Skip-gram Model) in a word2vec algorithm is used for training, the Model comprises an input layer neural network, a projection layer neural network and an output layer neural network, semantic information of a context is predicted through a current vocabulary, and the vocabulary probability is calculated through a formula (1):
P(w n-c ,w n-c+1 ,…,w n+c-1 ,w n+c |w n ) (1)
wherein w is n Representing the nth vocabulary, c is the size of a sliding window, in the training parameters, setting the dimension of a word vector to be 250 dimensions, the window size to be 5, finally generating a word2vec word vector model through training, and marking the model as M w2v
3. The method for extracting Chinese entity relationship based on incremental learning and multimodal fusion according to claim 1, wherein in step 2, an external corpus of entity recognition pre-training models is obtained, and an entity recognition model is generated by combining BiLSTM and CRF algorithms, and is denoted as M ee The specific method comprises the following steps:
2.1 training based on MSRA_NER training data set by combining BiLSTM algorithm and CRF algorithm, wherein BiLSTM algorithm is also called as bidirectional LSTM algorithm, its input is the output of word embedding layer, namely word vector obtained by the conversion of embedding layer after text word segmentation, and is marked as (w 1 ,w 2 ,…,w n ),w n Representing the nth vocabulary, the output of the forward LSTM is noted asThe output of reverse LSTM is denoted +.>According to(2) Calculating the output of the final hidden layer:
2.2, the CRF layer is arranged behind the BiLSTM layer, and the output of the BiLSTM is restrained by learning a tag state transition probability matrix;
2.3, finally generating an entity recognition model through training, and marking as M ee
4. The method for extracting chinese entity relationship based on incremental learning and multimodal fusion as defined in claim 1 wherein in step 3, an external corpus of the pre-training model for dependency syntax analysis is obtained, and a dependency syntax analysis model is generated based on a dependency syntax analysis algorithm, and the method comprises the steps of:
the training corpus is a Ha-Gong Chinese dependency corpus, the corpus is trained by using a dependency syntax analysis algorithm, the interdependence relationship among grammar components in sentences is learned, and finally a dependency syntax analysis model is generated and recorded as M dp
5. The method for extracting Chinese entity relationship based on incremental learning and multimodal fusion according to claim 1, wherein in step 4, a plurality of relationship data clusters are initialized according to predefined categories of basic relationship between entities and basic relationship vocabulary under each category, and the specific method comprises:
4.1, basic relationship class labels c= (C) between predefined entities 1 ,c 2 ,…,c m ) Wherein m is a relationship class number;
4.2 collecting and sorting basic relation vocabularies under each category, wherein the vocabulary number under each category is not less than 20, and the vocabulary number of each category is recorded as P= (P) 1 ,…,p i ,…,p m ) Wherein p is i Vocabulary number representing the i-th category;
4.3 using the word vector model M generated in step 1 w2v Base under each relationship categoryBasic vocabulary is converted into word vector and recorded asFinally, m relational data clusters are formed and marked as CU= (CU) 1 ,…,cu i ,…,cu m ) Wherein, the method comprises the steps of, wherein,relational data cluster representing the ith category, the data amount in the cluster being p i L is the data dimension of the word vector.
6. The method for extracting Chinese entity relation based on incremental learning and multimodal fusion according to claim 1, wherein in step 5, an incremental learning sample set of the expanded relation data clusters is obtained, an entity set of the samples is obtained by using an entity recognition model, subjects, predicates and objects of each sentence in the samples are extracted by using a dependency syntactic analysis model, predicates in the sentences are converted into word vectors by using a word vector model, the word vectors are projected into the plurality of relation data clusters initialized in step 4, then the data amount of each relation data cluster is continuously expanded by an incremental learning mode, and finally a plurality of relation data clusters which are completed through expansion are obtained, and the method comprises the following specific steps:
5.1, taking a Chinese text corpus of the search fox news as an increment learning sample set of an expansion relation data cluster, and storing the content in a TXT format, wherein the content is recorded as phi= (T) 1 ,T 2 ,…,T n ) Wherein n is the number of samples;
5.2 for each text T in the sample set i Using the entity recognition model M generated in step 2 ee Extracting the entities, performing operations of de-duplication and filtering the stop words to obtain an entity set, and marking the entity set as E;
5.3 for text T i Performing sentence processing;
5.4, for each sentence in the text, using the dependency syntax analysis model M generated in step 3 dp Extracting subjects, predicates and objects of sentences to form a triplet form, and marking the triplet form as (S, V, O);
5.5, judging whether the subject S and the object O in the triplet exist in the entity set E or not, and if so, continuing; if not, skipping;
5.6 using M generated in step 1 w2v The model converts the predicate V into a word vector V, matches the word vector V with m relational data clusters CU, and skips if the relational word vector data exists; if not, continuing;
5.7, calculating the similarity between the relation cluster and the ith relation cluster according to the formula (3):
where cos (-) represents the cosine similarity function between vectors,representing a word vector converted from a jth word under the ith relation class;
5.8 obtaining the relation data cluster category index corresponding to the maximum similarity according to the formula (4)
If it is the maximum similarityIs greater than or equal to the set similarity threshold +.>The word vector v is extended to the relational data cluster +.>In (i.e.)>If its maximum similarity is smaller than the threshold +.>Skipping;
and 5.9, continuously executing according to the incremental learning mode until all texts in the sample set phi are executed, storing all data and parameters, exiting iteration, and finally obtaining m relation data clusters CU with expanded and completed.
7. The method for extracting Chinese entity relation based on incremental learning and multi-model fusion according to claim 1, wherein in step 6, a test sample set of Chinese entity relation extraction is obtained, an entity set of test samples is obtained by combining an entity recognition model, subjects, predicates and objects of each sentence in the test sample are extracted by utilizing a dependency syntactic analysis model, predicates in the sentences are converted into word vectors by utilizing a word vector model, the word vectors are projected into a plurality of relation data clusters which are expanded in step 5, corresponding relation categories are determined, and Chinese entity relation extraction is completed, and the method comprises the following specific steps:
6.1, obtaining a test sample set extracted by Chinese entity relation as ψ= (T) 1 ,T 2 ,…,T q ) Wherein q is the number of test samples;
6.2 for each text T in the test sample set i Using the entity recognition model M generated in step 2 ee Extracting the entities, performing operations of de-duplication and filtering the stop words to obtain an entity set, and marking the entity set as E;
6.3 for text T i Performing sentence processing;
6.4, for each sentence in the text, using the dependency syntax analysis model M generated in step 3 dp Extracting subjects, predicates and objects of sentences to form a triplet form, and marking the triplet form as (S, V, O);
6.5, judging whether the subject S and the object O in the triplet exist in the entity set E or not, and if so, continuing; if not, skipping the triplet;
6.6 using the word vector model M generated in step 1 w2v Converting predicate V into word vector V, projecting the word vector V into m relational data cluster CUs obtained in step 5, and calculating a relational cluster category index corresponding to the maximum similarity according to formula (3) and formula (4)Then the relation cluster category->As the relation between the entity S and the entity O, the relation triplet existing in the sentence is returned
And 6.7, continuously extracting the relation of the test data according to the mode until all texts are extracted, and returning all extraction results.
8. A chinese entity relationship extraction system based on incremental learning and multimodal fusion, wherein chinese entity relationship extraction based on incremental learning and multimodal fusion is performed based on the method of any one of claims 1-7.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-7 for incremental learning and multimodal fusion based extraction of chinese entity relationship when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-7 for chinese entity relation extraction based on incremental learning and multimodal fusion.
CN202110091226.7A 2021-01-22 2021-01-22 Chinese entity relation extraction method based on incremental learning and multi-model fusion Active CN112749549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110091226.7A CN112749549B (en) 2021-01-22 2021-01-22 Chinese entity relation extraction method based on incremental learning and multi-model fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110091226.7A CN112749549B (en) 2021-01-22 2021-01-22 Chinese entity relation extraction method based on incremental learning and multi-model fusion

Publications (2)

Publication Number Publication Date
CN112749549A CN112749549A (en) 2021-05-04
CN112749549B true CN112749549B (en) 2023-10-13

Family

ID=75652977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110091226.7A Active CN112749549B (en) 2021-01-22 2021-01-22 Chinese entity relation extraction method based on incremental learning and multi-model fusion

Country Status (1)

Country Link
CN (1) CN112749549B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360641B (en) * 2021-05-07 2023-05-30 内蒙古电力(集团)有限责任公司乌兰察布电业局 Deep learning-based power grid fault handling plan semantic modeling system and method
CN113705196A (en) * 2021-08-02 2021-11-26 清华大学 Chinese open information extraction method and device based on graph neural network
CN113657116B (en) * 2021-08-05 2023-08-08 天津大学 Social media popularity prediction method and device based on visual semantic relationship

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170089142A (en) * 2016-01-26 2017-08-03 경북대학교 산학협력단 Generating method and system for triple data
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN111223539A (en) * 2019-12-30 2020-06-02 同济大学 Method for extracting relation of Chinese electronic medical record

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170089142A (en) * 2016-01-26 2017-08-03 경북대학교 산학협력단 Generating method and system for triple data
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN111223539A (en) * 2019-12-30 2020-06-02 同济大学 Method for extracting relation of Chinese electronic medical record

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
专利情报方法、工具、应用研究进展及新技术应用趋势;吕璐成;罗文馨;许景龙;王莉莉;马丽婧;赵亚娟;;情报学进展(第00期);全文 *
开放互联网中的学者画像技术综述;袁莎;唐杰;顾晓韬;;计算机研究与发展(第09期);全文 *

Also Published As

Publication number Publication date
CN112749549A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN112749549B (en) Chinese entity relation extraction method based on incremental learning and multi-model fusion
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN109871535B (en) French named entity recognition method based on deep neural network
CN107748757B (en) Question-answering method based on knowledge graph
CN110209836B (en) Remote supervision relation extraction method and device
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN111950287B (en) Entity identification method based on text and related device
CN111797241B (en) Event Argument Extraction Method and Device Based on Reinforcement Learning
TW201432669A (en) Acoustic language model training method and apparatus
CN110888980A (en) Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN111400455A (en) Relation detection method of question-answering system based on knowledge graph
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN111242033A (en) Video feature learning method based on discriminant analysis of video and character pairs
CN111274829A (en) Sequence labeling method using cross-language information
CN113392265A (en) Multimedia processing method, device and equipment
CN114743143A (en) Video description generation method based on multi-concept knowledge mining and storage medium
CN113065349A (en) Named entity recognition method based on conditional random field
CN111340006A (en) Sign language identification method and system
Hassani et al. LVTIA: A new method for keyphrase extraction from scientific video lectures
CN117132923A (en) Video classification method, device, electronic equipment and storage medium
CN113096687B (en) Audio and video processing method and device, computer equipment and storage medium
CN113377844A (en) Dialogue type data fuzzy retrieval method and device facing large relational database
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN112766368A (en) Data classification method, equipment and readable storage medium
CN117407532A (en) Method for enhancing data by using large model and collaborative training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant