CN115600602B - Method, system and terminal device for extracting key elements of long text - Google Patents

Method, system and terminal device for extracting key elements of long text Download PDF

Info

Publication number
CN115600602B
CN115600602B CN202211592205.4A CN202211592205A CN115600602B CN 115600602 B CN115600602 B CN 115600602B CN 202211592205 A CN202211592205 A CN 202211592205A CN 115600602 B CN115600602 B CN 115600602B
Authority
CN
China
Prior art keywords
model
sample
training
sequence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211592205.4A
Other languages
Chinese (zh)
Other versions
CN115600602A (en
Inventor
李芳芳
曾咏哲
胡世雄
罗垲炜
甘甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202211592205.4A priority Critical patent/CN115600602B/en
Publication of CN115600602A publication Critical patent/CN115600602A/en
Application granted granted Critical
Publication of CN115600602B publication Critical patent/CN115600602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system and a terminal device for extracting key elements of a long text, wherein the method for extracting the key elements comprises the steps of dividing a labeled sample set into a training set, a verification set and a test set; performing semantic-based data enhancement on the training set to obtain an enhanced set; updating parameters of a plurality of key element extraction base models through the enhancement set; expanding the training set by converting unlabeled samples into labeled value sample sets and then performing cyclic training; and finally confirming the optimal key element extraction base model. The key element extraction system comprises a data enhancement module, a model optimization module, a sample expansion module and the like; the terminal device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor; the method solves the problems of low quality of the generated labeled sample and low precision of element extraction in the key element extraction method based on the traditional small sample learning framework.

Description

Method, system and terminal device for extracting key elements of long text
Technical Field
The invention relates to the technical field of text information extraction, in particular to a method, a system and a terminal device for extracting key elements of a long text.
Background
Named entity recognition tasks aim at obtaining entities and entity classes with specific meanings, such as person names, place names, organization names, and the like, from a large amount of text data. The task is one of important subtasks of natural language processing, and provides bottom-layer support for multiple tasks such as intelligent question answering, knowledge graph, syntactic analysis and the like as a key technology.
The named entity recognition technology is continuously developed, the method based on statistics and manual definition rules in the early stage is developed to the method based on feature engineering and machine learning, and then the method based on deep learning which is popular in recent years is developed, and the model recognition effect is obviously improved. With the proliferation of network information resources and the continuous development of intelligent information extraction technologies, the universal named entity identification technology is difficult to meet the requirements of domain named entity identification, the network long text key element extraction task focuses more on named entities related to public sentiment services, and the following problems exist in the network long text key element extraction process:
(1) In the prior art, the key element extraction algorithm for the network long text adopts deep learning, the extraction performance depends on the scale and quality of the labeled corpus to a great extent, and because the labeled network long text corpus is scarce, sufficient data is often difficult to acquire to train a model, so that the model cannot capture sufficient data patterns;
in the existing research, small sample learning is usually used to extract key elements so as to effectively solve the problem of small amount of training sample data, but the conventional small sample learning scheme often adopts a single strategy, including a method based on data enhancement, a method based on transfer learning, a method based on active learning, a method based on weak supervised learning, and the like, and has respective strategy disadvantages to a certain extent, for example: 1) The active learning-based key element extraction method generally has the problem of single sampling criterion, only one of indexes such as uncertainty, diversity and the like is considered, and an expert is still required to participate in a data annotation process; 2) The key element extraction method based on simple data enhancement is easy to ignore high-order characteristic information of the layers of syntax, semantics and the like of a text in the process of expanding data, and cannot perform comprehensive modeling aiming at the characteristics of the data, so that the quality of a training sample is low, and more noise is introduced into a key element extraction model; 3) The prior art does not fully combine the knowledge of the existing model and various strategies to effectively utilize mass un-labeled network long text data, and the cost of training the model is still higher.
(2) Different from named entity identification in the public field, the key element extraction task of the network long text mainly focuses on named entities closely related to the public opinion field, and aims to extract boundaries and categories of the public opinion named entities from unstructured network long texts. However, the fine-grained public opinion named entities have the characteristics of various categories, strong domain characteristics and the like, for example, the definition of "event-related entity" in the public opinion domain may be divided into elements with public opinion domain properties, such as "event-related person", "event-related media", "event-related company" and the like, and the classification of the elements is closely related to the context expression. The existing method for extracting key elements of the network long text is difficult to process long-distance dependence characteristics among public opinion named entities, and entity classification errors are easily caused.
Aiming at the technical problems of the prior art of extracting key elements in the network long text that the high-quality text data is deficient and the extraction precision is not high, an effective solution is not provided for a while. Therefore, how to construct a training sample expansion scheme for a small number of labeled samples and a large number of unlabeled samples, and construct a key element extraction model for a long network text to obtain a high-precision key element extraction result becomes a problem to be solved urgently.
In summary, a method, a system and a terminal device for extracting key elements of long texts are urgently needed to solve the problems in the prior art.
Disclosure of Invention
The invention aims to provide a method, a system and a terminal device for extracting key elements of a long text, which aim to solve the problem of improving the extraction precision of the key elements of the long text.
In order to achieve the purpose, the invention provides a method for extracting key elements of a long text, which comprises the following steps:
the method comprises the following steps: acquiring a long text data set, and dividing the long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; dividing an unlabeled sample set U into a plurality of subsets;
step two: preprocessing a training set Tr to obtain a text sequence and a label sequence of each training sample in the training set Tr;
step three: continuously pre-training the open source model to obtain a plurality of pre-training language models, and performing model migration through the pre-training language models to obtain a plurality of key element extraction base models;
step four: combining a text sequence and a label sequence of each training sample in the training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhanced set En;
step five: updating parameters of the extracted basic models of the key elements through an enhancement set En;
step six: preprocessing the verification set Va, acquiring a text sequence and a label sequence of each verification sample in the verification set Va, inputting the text sequence and the label sequence into each key element extraction base model respectively, and determining an optimal key element extraction base model corresponding to each key element extraction base model;
step seven: judging whether a training stopping criterion is met; if the training stopping criterion is met, executing the step ten, and if the training stopping criterion is not met, executing the step eight;
step eight: inputting a subset in the unlabeled sample set U into each optimal key element extraction base model to obtain a corresponding labeled value sample set;
step nine: adding the value-labeled sample set obtained in the step eight into a training set Tr, and removing the subset obtained in the step eight from an unlabeled sample set U; repeating the fourth step to the ninth step until the training stopping criterion is met;
step ten: and performing key element extraction on the test set Te through the optimal key element extraction base model.
Preferably, the third step includes:
step 3.1: continuously pre-training the source model through a public opinion corpus to obtain a corresponding pre-training language model, wherein the pre-training language model comprises: a DeBERta model, a LEBERT model, a CogBERT model, a Syntax model, and a Senntence-BERT model;
step 3.2: carrying out model migration on part of the pre-training language models, wherein the obtained key element extraction base model comprises the following steps: the DeBERTA-BiONLSTM-MHA-MCRF model, the LEBERT-BiONLSTM-MHA-MCRF model, and the CogBERT-BiONLSTM-MHA-MCRF model.
Preferably, the fourth step includes:
step 4.1: define the th in the training set TriA training sample tr i Is s i And the tag sequence is l i Obtaining a text sequence s by a SyntaxBERT model i Semantic vector of each character in the text sequence s i A tag sequence l i Extracting training samples tr i Form a subset of entity samples Ent i First of a subset of the entity samplesjIndividual entity sample ent j =<StrEnt j ,Type j >Wherein StrEnt j As entity sample ent j Is a string representation of, type j Is an entity sample ent j The entity class of (1);
step 4.2: for StrEnt j The semantic vector of the character is averaged to obtain an entity sample ent j The corresponding entity vector entEmb j
Step 4.3: forming entity samples in entity sample subsets corresponding to all training samples into an entity sample set Ent of the training set, calculating cosine similarity between entity vectors of any two entity samples in the entity sample set Ent, and when the Sim is larger than or equal to sigma and the entity classes of the two entity samples are the same, respectively adding the two entity samples into semantic neighbor sets of each other, wherein the sigma is a preset entity similarity threshold;
step 4.4: traverse each training sample tr i If the entity sample ent j Is not an empty set, then at ent j Semantic neighbor set ofIn which one entity sample pair is selected j Replacing to obtain an enhanced text sequence s i * And its corresponding tag sequence l i * Thus obtaining the sentence pair sample pair to be evaluated i =<s i ,s i * >;
Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s i The sentence vector of is SenEmb i ,s i * The sentence vector of is SenEmb i * Calculate SenEmb i And SenEmb i * Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced i * And its corresponding tag sequence l i * As an extended sample, adding all the extended samples into a training set Tr to obtain an enhanced set En; where β is a preset sentence similarity threshold.
Preferably, in the sixth step, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model.
Preferably, in the seventh step, the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va has reached a preset performance threshold α, or the training set Tr has reached a preset data volume.
Preferably, the step eight includes:
step 8.1: extracting each unlabeled sample u in the base model pair subset US through the optimal key elements m Us of the text sequence m Predicting to obtain a weak label sequence set corresponding to the sample; wherein the content of the first and second substances,m=1,2, … … Z, Z is the total number of unlabeled samples in the subset;
step 8.2: calculating an unlabeled sample u according to a Dropout consistency score function m Uncertainty compared to the global modelDACS(u m );
Step 8.3: respectively obtaining unlabeled samples u through a sequence-BERT model m Sentence vector SenEmb m And the sentence vectors of the marked samples, and clustering the sentence vectors of all the marked samples to obtain D clustering centers X d d=1,2,……D;
Step 8.4: calculating unlabeled sample u m Maximum semantic similarity to all cluster centersSimwt(u m );
Step 8.5: binding unlabeled samples u m Calculating to obtain an unlabeled sample u m Information density ofInfo(u m ) (ii) a If it isInfo(u m ) If the value is more than or equal to theta, in the weak label sequence set corresponding to the sample, for each maximum probability label sequence pair character w k The predictive tag of (2) is used as the character w k Final label, in which the character w k As unlabeled sample u m Us of a text sequence m To (1)kA character of each position;
thereby obtaining a text sequence us m Corresponding tag sequence ul m Will text sequence us m And the tag sequence ul m And forming the marked value samples, wherein all the marked value samples form a marked value sample set VS.
Preferably, in the step 8.2, the unlabeled sample u is calculated by the expressions 1) and 2) m Uncertainty compared to the global modelDACS(u m ):
Figure 683472DEST_PATH_IMAGE001
1);
Figure 778467DEST_PATH_IMAGE002
2);
Wherein the content of the first and second substances,Nas unlabeled sample u m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2,……JI’=1,2,……Jand is made ofII’
Figure 900007DEST_PATH_IMAGE003
Extracting base model M for optimal key elements I * For character w k The predicted tag of (a) is determined,
Figure 4229DEST_PATH_IMAGE004
extracting base model M for optimal key elements I’ * For character w k The predictive tag of (a);
in the step 8.4, the unlabeled sample u is calculated through the expression 3) m Maximum semantic similarity to all cluster centersSimwt(u m ):
Figure 629246DEST_PATH_IMAGE005
3);
Wherein the content of the first and second substances,
Figure 160721DEST_PATH_IMAGE006
the function represents the Min-Max normalization function,CosineSim(SenEmb m SenEmb d ) Representing a sentence vector SenEmb m And a clustering center X d Sentence vector SenEmb d Cosine similarity between them;
in the step 8.5, the unlabeled sample u is obtained by calculation of the expression 4) m Information density ofInfo(u m ):
Figure 503978DEST_PATH_IMAGE007
4);
Wherein, the first and the second end of the pipe are connected with each other,μis a preset regulatory factor.
Preferably, the step ten includes:
step 10.1: preprocessing the test set Te and obtaining a text sequence ts of each test sample q
Step 10.2: respectively extracting the text sequence ts of the base model for each test sample through each optimal key element after training q Predicting to obtain a text sequence ts q Corresponding maximum probability label sequences, and forming a candidate label sequence set by the maximum probability label sequences of all test texts;
step 10.3: definition a e For text sequences ts q To (1)eThe characters of each position in the candidate label sequence set are corresponding to the character a of each maximum probability label sequence e The predicted label of (2) is the character a with the predicted label with the largest number of occurrences e The final label, resulting in the text sequence ts q Corresponding final tag sequence tl q And extracting results as key elements.
The invention also provides a system for extracting the key elements of the long text, which is used for realizing the method for extracting the key elements of the long text and comprises a definition module, a preprocessing module, a model construction module, a data enhancement module, a model optimization module, a sample expansion module and a model test module;
a definition module: for obtaining a long text data set;
a pretreatment module: the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for preprocessing text data to obtain a text sequence and a label sequence;
a model construction module: the method comprises the steps of performing continuous pre-training and model migration on an open source model to obtain a key element extraction base model;
the data enhancement module: the method comprises the steps of performing data enhancement on a training set Tr to obtain an enhancement set En;
a model optimization module: the parameter updating module is used for updating parameters of the key element extraction base model through the enhancement set;
a sample expansion module: the system is used for generating a value-labeled sample set and adding the value-labeled sample set into a training set Tr;
a model testing module: and the method is used for extracting the key elements of the test set Te through the optimal key element extraction base model.
The invention also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method for extracting the key elements of the long text.
The technical scheme of the invention has the following beneficial effects:
(1) According to the invention, by using the key element extraction method integrating data enhancement, transfer learning, active learning and weak supervised learning, value samples with more abundant information amount can be screened out in a low-resource scene and automatically labeled accurately and efficiently, and meanwhile, a large number of high-quality labeled samples are generated under the condition that the syntactic structure and the semantics of the original text are not changed as far as possible, so that the manual labeling cost is remarkably reduced, unlabeled samples are fully utilized to the maximum extent, the performance of a key element extraction model is effectively improved, and the problems of low quality of the labeled samples, low element extraction precision and high cost of expert labeled value samples generated in the key element extraction method based on the traditional small sample learning framework are solved.
(2) According to the method, the pre-training is continuously carried out on the open source model on the large-scale public opinion corpus, the parameter-based transfer learning is carried out on the open source model, the pre-training language model transfers the weight parameters to the shared parameters of the key element extraction base model, a plurality of key element extraction base models capable of fully excavating long-distance dependent semantic features in the text are constructed, the performance of the model is initialized to the maximum extent by using the existing knowledge under the condition that the data in the field of long-text network is less, the model can be finely adjusted by the network long-text labeling data of the target field subsequently, the performance of the model on the task of extracting the key elements of the long-text network is improved, and the computing resources and the cost are saved.
(3) In the invention, data enhancement is carried out based on semantics, when data enhancement is carried out on network long text data, the text is coded based on a Syntax pre-training language model and a Sennce-BERT pre-training language model, compared with the traditional pre-training language model, the syntax and semantic information of the text can be better coded, in the enhancement process, a target entity which belongs to the same entity type as the original entity and has similar semantics with the original entity is firstly used for replacing the original entity, an enhanced Sentence after entity replacement is obtained, and the enhanced Sentence and a tag sequence thereof which have similar semantics with the original Sentence are further stored. The method can efficiently encode the syntax and semantic information of the text, protect the semantic feature information of the entity level and the sentence level and better improve the quality of the enhanced data, so the method can furthest reduce the loss of the semantic information of the entity level and the sentence level in the data enhancement process, improve the quality of the enhanced data, simultaneously obviously reduce the cost of manual labeling and obviously improve the extraction precision of the key element extraction model.
(4) In the invention, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is taken as the model parameter of each corresponding optimal key element extraction base model, because the F1 value is the harmonic average value of the accuracy rate and the recall rate, and the accuracy rate and the recall rate are balanced and considered at the same time, so that the method is a more comprehensive evaluation index.
(5) In the invention, value samples are selected by combining Dropout consistency indexes (namely uncertainty) and semantic similarity indexes (namely maximum semantic similarity), and a combined query criterion is utilized, wherein the criterion can ensure that the selected samples have the most uncertainty based on the current overall model, and the information content of the samples in the whole situation is also considered by the semantic similarity, so that the negative influence of outlier isolated samples on the overall model is relieved. Compared with the traditional active learning sampling criterion, the combined query criterion more comprehensively describes the samples to be screened, and is favorable for selecting the samples with richer information content; the method comprises the steps of using a plurality of optimal key element extraction base models to predict the same value sample to construct a weak supervision learning scene, carrying out preferential inference on a predicted entity through a multi-model voting mode at an entity level, further exerting the advantages of different weak supervision models, further obtaining higher key element extraction precision on the basis of a plurality of homogeneous base models, and enabling the integrated model to have higher generalization and robustness compared with a single base model.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. In the drawings:
fig. 1 is a flowchart of a method for extracting key elements from a long text in embodiment 1 of the present application;
fig. 2 is a schematic structural diagram of a long text key element extraction system according to embodiment 1 of the present application;
the system comprises a definition module 1, a preprocessing module 2, a model construction module 3, a data enhancement module 4, a model optimization module 5, a sample expansion module 6 and a model testing module 7.
Detailed Description
Embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways, which are defined and covered by the claims.
Example 1:
referring to fig. 1 to 2, the present embodiment is applied to key element extraction of a long text on a network, and is convenient for timely and accurately extracting key public opinion information from the long text on the network.
A method for extracting key elements of a long text, referring to FIG. 1 (S1-S10 in the figure represent steps one to ten), comprises the following steps:
the method comprises the following steps: acquiring a network long text data set on a network platform, and dividing the long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; dividing an unlabeled sample set U into a plurality of subsets;
step two: preprocessing a training set Tr to obtain a text sequence and a label sequence of each training sample in the training set Tr;
step three: continuously pre-training the open source model to obtain a plurality of pre-training language models, and carrying out model migration through the pre-training language models to obtain a plurality of key element extraction base models M I I=1,2,……JJThe method specifically comprises the following substeps of extracting the number of models of the base model for the key elements:
step 3.1: collecting about 100 pieces of Internet public opinion texts from a 'people data' platform to complete the construction of a public opinion corpus; continuing to pre-train the sourcing model through a public opinion corpus to obtain a corresponding pre-trained language model, wherein the pre-trained language model comprises: a DeBERta model, a LEBERT model, a CogBERT model, a Syntax model, and a Senntence-BERT model;
step 3.2: model migration is carried out on part of the pre-training language model through a parameter-based migration learning method, and a corresponding key element extraction base model M is obtained by combining a Bi-directional Ordered neural Long Short Term Memory network (Bi-directional Ordered nerves Long Short-Term Memory, biONLSTM), a Multi-Head Attention Mechanism (MHA) and a mask added Conditional Random Field (MCRF) I In this embodiment, three key element extraction basis models are obtained: deBERTA-BiONLSTM-MHA-MCRF model (defined as M) 1 ) The LEBERT-BiONLSTM-MHA-MCRF model (defined as M) 2 ) And the CogBERT-BiONLSTM-MHA-MCRF model (defined as M) 3 )。
The method comprises the steps of continuously pre-training an open source model on a large-scale public opinion corpus, performing parameter-based transfer learning on the open source model, transferring weight parameters to shared parameters of a key element extraction base model by a pre-training language model, constructing and obtaining a plurality of key element extraction base models capable of fully excavating long-distance dependence semantic features in a text, initializing the performance of the model by utilizing the existing knowledge to the maximum extent under the condition that data in the field of long-text networks are few, finely adjusting the model by network long-text labeling data of a target domain subsequently, assisting in improving the performance of the model on a network long-text key element extraction task, and saving computing resources and cost.
Step four: combining a text sequence and a label sequence of each training sample in a training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhancement set En; the method specifically comprises the following substeps:
step 4.1: define the second in the training set TriA training sample tr i Is s i The sequence of the tag is l i Obtaining a text sequence s by a SyntaxBERT model i Semantic vector of each character in the text sequence s i A tag sequence l i Extracting training samples tr i Form a subset of entity samples Ent i First of a subset of the entity samplesjIndividual entity sample ent j =<StrEnt j ,Type j >Wherein StrEnt j Is an entity sample ent j Type of j Is an entity sample ent j The entity class of (1);
step 4.2: for entity sample subset Ent i Each entity sample ent in j For constituting character string, strEnt is expressed j The semantic vector of the character is averaged to obtain the entity sample ent j The corresponding entity vector entEmb j
Step 4.3: forming entity samples in entity sample subsets corresponding to all training samples into an entity sample set Ent of the training set, calculating cosine similarity between entity vectors of any two entity samples in the entity sample set Ent, and when the Sim is larger than or equal to sigma and the entity classes of the two entity samples are the same, respectively adding the two entity samples into semantic neighbor sets of each other, wherein the sigma is a preset entity similarity threshold;
for example, any two entity samples Ent in the entity sample set Ent corresponding to the training set Tr are taken x 、ent y The semantic neighbor sets corresponding to the two entity samples are Ne x 、Ne y Calculating the entity vector entEmb of the two entity samples x 、entEmb y Cosine similarity Sim between two entity samples, when Sim is greater than or equal to sigma and entity Type of two entity samples x =Type y When it is, if ent x ∉Ne y Then will be ent x Addition of Ne y In if ent y ∉Ne x Then will be ent y Adding Ne x In (1).
Step 4.4: traverse each training sample tr i If the entity sample ent j Semantic neighbor set Ne of j If not, then it is in ent j Randomly selecting an entity sample pair in the semantic neighbor set j Replacing to obtain an enhanced text sequence s i * And its corresponding tag sequence l i * Thereby obtaining a sentence pair sample pair to be evaluated i =<s i ,s i * >;
Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s i The sentence vector of is SenEmb i ,s i * The sentence vector of is SenEmb i * Calculate SenEmb i And SenEmb i * Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced i * And corresponding theretoTag sequence l i * As an extended sample, constructing all extended samples into an extended sample set Exp, adding the extended sample set Exp into a training set Tr to obtain an enhanced set En, namely En = Tr U Exp; where β is a preset sentence similarity threshold.
The traditional text Data enhancement scheme is usually realized based on simple Data Enhancement (EDA), the EDA generates an extended sample similar to an original labeled sample through a noise adding strategy of synonym replacement, random insertion, random exchange and random deletion, but a network long text key element extraction task is greatly influenced by entity context semantic information, and the EDA is directly applied to the network long text key element extraction task to easily generate an extended sentence with incorrect semantic meaning, so that larger noise influence is brought, and the semantic information of an entity level sentence and a level sentence is kept as much as possible in the Data enhancement process.
In view of the above problems, in the fourth step of this embodiment, data enhancement is performed based on semantics, and when data enhancement is performed on network long text data, a text is encoded based on a SyntaxBERT pre-training language model and a Sentence-BERT pre-training language model, which can better encode syntax and semantic information of the text compared with a conventional pre-training language model. The method can efficiently encode the syntax and semantic information of the text, protect the semantic feature information of the entity level and the sentence level and better improve the quality of the enhanced data, so the method can furthest reduce the loss of the semantic information of the entity level and the sentence level in the data enhancement process, improve the quality of the enhanced data, simultaneously obviously reduce the cost of manual labeling and obviously improve the extraction precision of the key element extraction model.
Step five: training a plurality of key element extraction base models through an enhancement set En, and realizing parameter updating;
step six: preprocessing the verification set Va, acquiring a text sequence and a label sequence of each verification sample in the verification set Va, and respectively inputting the text sequence and the label sequence to each key element extraction base model M I Determining the optimal key element extraction base model M corresponding to each key element extraction base model I * (ii) a Specifically, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model, so that the optimal key element extraction base model M corresponding to each key element extraction base model is obtained I * This is because the F1 value is a harmonic mean value of the precision rate and the recall rate, and is a more comprehensive evaluation index while balancing and considering the precision rate and the recall rate.
Step seven: judging whether a training stopping criterion is met;
the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va reaches a preset performance threshold value alpha, or the training set Tr reaches a preset data volume. The performance threshold alpha and the preset data amount are set according to actual needs.
If the training stopping criterion is met, executing a step ten, and if the training stopping criterion is not met, executing a step eight;
step eight: inputting a subset in the unlabeled sample set U into each optimal key element extraction base model to obtain a corresponding labeled value sample set; the method specifically comprises the following substeps:
step 8.1: extraction of base model M by optimal key elements 1 * 、M 2 * 、M 3 * For each unlabeled sample u in the subset US m Us of the text sequence m Predicting to obtain a text sequence us m The corresponding maximum probability label sequences are respectively cl m1 、cl m2 、cl m3 Adding the maximum probability label sequence corresponding to each unlabeled sample as a group of data to the weak probability label sequence corresponding to the sampleObtaining a weak label sequence set CL corresponding to the sample in the label sequence setm(ii) a Wherein the content of the first and second substances,m=1,2, … … Z, Z being the total number of unlabeled samples in the subset US;
step 8.2: calculating an unlabeled sample u according to a Dropout consistency score function m Uncertainty compared to the global modelDACS(u m ) (ii) a Specifically, calculating the unlabeled sample u by the expression 1) and the expression 2) m Uncertainty compared to the global modelDACS(u m ):
Figure 411891DEST_PATH_IMAGE001
1);
Figure 156993DEST_PATH_IMAGE002
2);
Wherein, the first and the second end of the pipe are connected with each other,Nas unlabeled sample u m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2,……JI’=1,2,……Jand is andII’
Figure 859370DEST_PATH_IMAGE003
extracting base model M for optimal key elements I * For character w k The prediction tag of (a) is determined,
Figure 689923DEST_PATH_IMAGE004
extracting base model M for optimal key elements I’ * For character w k Predictive label of, character w k As unlabeled sample u m Us of the text sequence m To (1)kA character of each position.
When in useDACS(u m ) The larger the value of (A) is, the more the sample is not labeledThis u m The larger the uncertainty for the overall model, the higher its labeling value.
Step 8.3: obtaining unlabeled sample u by a Sennce-BERT model m Sentence vector SenEmb m (ii) a Sentence vectors of all marked samples are obtained through a sequence-BERT model, and the Sentence vectors of all marked samples are clustered to obtain D clustering centers X d d=1,2, … … D; in this embodiment, the sentence vectors of all marked samples are clustered by the K-Means + + algorithm.
Step 8.4: calculating unlabeled sample u m Maximum semantic similarity to all cluster centersSimwt(u m ) (ii) a Specifically, calculating the unlabeled sample u by the expression 3) m Maximum semantic similarity to all cluster centersSimwt(u m ):
Figure 41007DEST_PATH_IMAGE005
3);
Wherein the content of the first and second substances,
Figure 375037DEST_PATH_IMAGE006
the function represents the Min-Max normalization function,CosineSimSenEmb m SenEmb d ) Representing a sentence vector SenEmb m And a clustering center X d Sentence vector SenEmb d Cosine similarity between them.
When the temperature is higher than the set temperatureSimwt(u m ) The larger the value of (a), the unmarked sample u is indicated m The larger the probability of belonging to a known class, the higher the representativeness of the contained information, otherwise, it indicates that the sample u is not labeled m The greater the probability of belonging to an outlier sample, the lower the representativeness of the contained information.
Step 8.5: binding unlabeled samples u m Calculating to obtain an unlabeled sample u m Information density ofInfo(u m ) (ii) a Specifically, calculating by expression 4) to obtain an unlabeled sample u m Information density ofInfo(u m ):
Figure 248315DEST_PATH_IMAGE007
4);
Wherein the content of the first and second substances,μand setting the regulation factors for preset regulation factors according to actual requirements.
If it isInfo(u m ) When the value is more than or equal to theta, CL is in the weak label sequence setmFor each maximum probability tag sequence pair character w k The predictive tag of (2) is used as the character w k A final label;
thereby obtaining a text sequence us m Corresponding tag sequence ul m Will text sequence us m And the tag sequence ul m Forming marked value samples, wherein all marked value samples form a marked value sample set VS, and the marked value sample set VS comprises Z * Individual marked value sample, Z * ≤Z。
In many existing schemes, labeling cost required for training a key element extraction model is reduced through active learning, a single uncertainty index is often used for screening value samples in common methods, but distribution information among unlabeled samples is ignored, and in order to effectively exert the advantages of different active learning sampling criteria on a key element extraction task, a Dropout consistency index (uncertainty) and a semantic similarity index (maximum semantic similarity) are fused, the value samples are selected by utilizing a combined query criterion, the criterion can enable the selected samples to be the most uncertain based on a current overall model, and the information quantity of samples in the overall situation is also considered through the semantic similarity, so that negative effects brought by outlier isolated samples to the overall model are relieved. Compared with the traditional active learning sampling criterion, the combined query criterion more comprehensively describes the samples to be screened, and is favorable for selecting the samples with richer information content.
Furthermore, a plurality of optimal key element extraction base models are used for predicting the same marked value sample to construct a weak supervision learning scene, the entity-level multi-model voting mode is used for preferentially deducing the predicted entity, the advantages of different weak supervision models are further exerted, higher key element extraction precision can be further obtained on the basis of a plurality of homogeneous base models, and the integrated model has higher generalization and robustness compared with a single base model.
The method effectively integrates two strategies of active learning and weak supervised learning, saves the expensive cost of expert labeling data in the traditional active learning method, can screen out value samples with richer information content and accurately and efficiently label the value samples automatically, obviously reduces the manual labeling cost, and fully utilizes the unlabeled samples to improve the performance of extracting key elements.
Step nine: adding the marked value sample set VS obtained in the step eight into the training set Tr to form a new training set, so that the data volume in the training set can be increased, and removing the subset US in the step eight from the unmarked sample set U to form a new unmarked sample set; repeating the fourth step to the ninth step until the training stopping criterion is met;
step ten: and performing key element extraction on the test set Te through the optimal key element extraction base model.
Step 10.1: preprocessing the test set Te and obtaining a text sequence ts of each test sample q
Step 10.2: extracting base model M through each optimal key element after training 1 * 、M 2 * 、M 3 * Text sequence ts for each test sample separately q Predicting to obtain a text sequence ts q Corresponding maximum probability tag sequence tl q1 、tl q2 、tl q3 Forming the maximum probability label sequences of all test texts into a candidate label sequence set TLq
Step 10.3: definition a e For text sequences ts q ToeCharacters of individual positions in the candidate tag sequence set TLqFor each maximum probability tag sequence pair character a e The predicted label of (2) is the character a with the predicted label with the largest number of occurrences e The final label, resulting in the text sequence ts q Corresponding final tag sequence tl q And extracting results as key elements.
According to the method, the value samples with richer information content can be screened out and automatically labeled accurately and efficiently in a low-resource scene, a large number of high-quality labeled samples are generated under the condition that the syntax structure and the semantics of the original text are not changed as far as possible, the manual labeling cost is obviously reduced, the unlabeled samples are fully utilized to the maximum extent, the performance of a key element extraction model is effectively improved, and the problems that the labeled samples generated in the key element extraction method based on the traditional small sample learning framework are low in quality, low in element extraction precision and high in cost of expert labeled value samples are solved.
When the network platform is monitored for public sentiment, the method can timely and accurately extract key public sentiment information from the network long text, can assist governments and enterprises to find potential risk hazards and focuses of network public sentiment events, and promotes the modernization of public sentiment monitoring systems and social governance capacity.
A key element extraction system of a long text is used for realizing the key element extraction method of the long text, and comprises a definition module 1, a preprocessing module 2, a model construction module 3, a data enhancement module 4, a model optimization module 5, a sample expansion module 6 and a model test module 7, as shown in FIG. 2;
the definition module 1: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring a network long text data set and dividing the network long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; the unlabeled sample set U is divided into a plurality of subsets.
The pretreatment module 2: the method is used for preprocessing the text data in the training set Tr, the verification set Va and the test set Te to obtain corresponding text sequences and label sequences.
The model construction module 3: the method is used for continuously pre-training the open source model and carrying out model migration to obtain a key element extraction base model M I
The data enhancement module 4: and the method is used for enhancing data of the training set Tr to obtain an enhanced set En.
The model optimization module 5: for extracting a base model M from each key element through an enhancement set En I Updating parameters; preprocessing a verification set Va to obtain a text sequence and a label sequence of each verification sample, and inputting the text sequence and the label sequence into each key element extraction base model M I In the method, the optimal key element extraction base model M with the highest F1 value in the verification set Va is obtained and stored I * Extracting key elements for the test set Te; until the current situation meets the training stop criterion.
The sample expansion module 6: if the current situation does not meet the training stopping criterion, inputting a subset US in the unlabeled sample set U into each optimal key element extraction base model M I * The method comprises the steps of predicting, screening unlabeled samples through improved value sample sampling and labeling criteria to obtain a labeled value sample set VS, adding the obtained labeled value sample set VS into a training set Tr to form a new training set, increasing the data volume in the training set, and removing a subset US from an unlabeled sample set U to form a new unlabeled sample set.
Model test module 7: inputting the text sequence of the test sample of the test set Te into the trained optimal key element extraction base model M I * And obtaining the maximum probability label sequence predicted by each model, and performing result integration on each maximum probability label sequence to obtain a key element extraction result.
A terminal device, comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the above-mentioned method for extracting key elements of long texts when executing the computer program.
Comparative example 1:
this comparative example differs from example 1 only in that the optimal DeBERTA-BiONLSTM-MHA-MCRF model (M) 1 * ) And extracting the key elements.
Comparative example 2:
this comparative example differs from example 1 only in that the optimal LEBERT-BiONLSTM-MHA-MCRF model (M) is used 2 * ) And extracting the key elements.
Comparative example 3:
this comparative example differs from example 1 only in that the optimal CogBERT-BiONLSTM-MHA-MCRF model (M) was used 3 * ) And extracting the key elements.
Comparative example 4:
this comparative example differs from example 1 in that the Ensemble model (fusion of M by the weakly supervised learning framework) was used 1 * 、M 2 * 、M 3 * The integration model) extracts the key elements.
Comparative example 5:
the present comparative example differs from example 1 in that key elements are extracted using the Ensemble (AL) model, AL (Active Learning) representing the use of an Active Learning strategy.
Comparative example 6:
the present comparative example differs from example 1 in that key elements are extracted using the Ensemble (DA) model, which represents the use of a Data enhancement strategy.
The experimental results of example 1 and the above comparative example on the validation set Te are shown in table 1:
table 1:
Figure 35005DEST_PATH_IMAGE008
as can be seen from the above table, compared with other key element extraction models, the key element extraction model of the method obtains an optimal key element extraction result on the test set, and has a significant technical effect.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for extracting key elements of a long text is characterized by comprising the following steps:
the method comprises the following steps: acquiring a long text data set, and dividing the long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; dividing an unlabeled sample set U into a plurality of subsets;
step two: preprocessing a training set Tr to obtain a text sequence and a label sequence of each training sample in the training set Tr;
step three: continuously pre-training the open source model to obtain a plurality of pre-training language models, and performing model migration through the pre-training language models to obtain a plurality of key element extraction base models;
step four: combining a text sequence and a label sequence of each training sample in a training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhancement set En;
step five: updating parameters of a plurality of key element extraction base models through an enhancement set En;
step six: preprocessing the verification set Va, acquiring a text sequence and a label sequence of each verification sample in the verification set Va, inputting the text sequence and the label sequence into each key element extraction base model respectively, and determining an optimal key element extraction base model corresponding to each key element extraction base model;
step seven: judging whether the training stopping criterion is met; if the training stopping criterion is met, executing a step ten, and if the training stopping criterion is not met, executing a step eight;
step eight: inputting a subset in the unmarked sample set U into each optimal key element extraction base model to obtain a corresponding marked value sample set;
step nine: adding the marked value sample set obtained in the step eight into a training set Tr, and removing the subset in the step eight from an unmarked sample set U; repeating the fourth step to the ninth step until the training stopping criterion is met;
step ten: and extracting key elements from the test set Te through the optimal key element extraction base model.
2. The method for extracting key elements of long texts according to claim 1, wherein the third step comprises:
step 3.1: continuously pre-training the source model through a public opinion corpus to obtain a corresponding pre-training language model, wherein the pre-training language model comprises: a DeBERTA model, a LEBERT model, a CogBERT model, a Syntax-BERT model, and a Sennce-BERT model;
step 3.2: carrying out model migration on part of the pre-training language models, wherein the obtained key element extraction base model comprises the following steps: deBERta-BiONLSTM-MHA-MCRF model, LEBERT-BiONLSTM-MHA-MCRF model, and CogBERT-BiONLSTM-MHA-MCRF model.
3. The method for extracting key elements of long texts according to claim 2, wherein the fourth step comprises:
step 4.1: define the second in the training set TriA training sample tr i Is s i The sequence of the tag is l i Obtaining a text sequence s by a SyntaxBERT model i Semantic vector of each character and according to the text sequence s i A tag sequence l i Extracting training samples tr i Form a subset of entity samples Ent i (iii) solid sampleThe first in this subsetjIndividual entity sample ent j =<StrEnt j ,Type j >Wherein StrEnt j Is an entity sample ent j Is a string representation of, type j Is an entity sample ent j The entity category of (a);
step 4.2: for StrEnt j The semantic vector of the character is averaged to obtain an entity sample ent j Corresponding entity vector entEmb j
Step 4.3: forming entity samples in entity sample subsets corresponding to all training samples into an entity sample set Ent of a training set, calculating cosine similarity between entity vectors of any two entity samples in the entity sample set Ent, and respectively adding the two entity samples into semantic neighbor sets of each other when the Sim is larger than or equal to sigma and the entity classes of the two entity samples are the same, wherein the sigma is a preset entity similarity threshold;
step 4.4: traverse each training sample tr i If the entity sample ent j Is not an empty set, then at ent j Selects an entity sample object from semantic neighbor set j Replacing to obtain an enhanced text sequence s i * And its corresponding tag sequence l i * Thus obtaining the sentence pair sample pair to be evaluated i =<s i ,s i * >;
Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s i The sentence vector of is SenEmb i ,s i * Sentence vector of SenEmb i * Calculate SenEmb i And SenEmb i * Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced i * And its corresponding tag sequence l i * As an extended sample, and allAdding the extended sample into a training set Tr to obtain an enhanced set En; where β is a preset sentence similarity threshold.
4. The method for extracting key elements from a long text according to claim 3, wherein in the sixth step, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model.
5. The method for extracting key elements from long texts as claimed in claim 4, wherein in the seventh step, the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va has reached a preset performance threshold α, or the training set Tr has reached a preset data volume.
6. The method for extracting key elements from long texts according to claim 4, wherein the eighth step comprises:
step 8.1: extracting each unlabeled sample u in the base model pair subset US through the optimal key elements m Us of a text sequence m Predicting to obtain a weak label sequence set corresponding to the sample; wherein the content of the first and second substances,m=1,2, … … Z, Z is the total number of unlabeled samples in the subset;
step 8.2: calculating an unlabeled sample u according to a Dropout consistency score function m Uncertainty compared to the global modelDACS(u m );
Step 8.3: respectively obtaining unlabeled samples u through a sequence-BERT model m Sentence vector SenEmb of m And the sentence vectors of the marked samples, and clustering the sentence vectors of all the marked samples to obtain D clustering centers X d d=1,2,……D;
Step 8.4: calculating unlabeled sample u m Maximum semantic similarity to all cluster centersSimwt(u m );
Step 8.5: binding unlabeled samples u m Calculating to obtain an unlabeled sample u m Information density ofInfo(u m ) (ii) a If it isInfo(u m ) If the value is more than or equal to theta, the character w is aligned to each maximum probability label sequence in the weak label sequence set corresponding to the sample k The predicted label of (2) is used as the character w k Final label, in which the character w k As unlabeled sample u m Us of the text sequence m To (1)kA character of each position;
thereby obtaining a text sequence us m Corresponding tag sequence ul m Will text sequence us m And the tag sequence ul m And forming the marked value samples, wherein all the marked value samples form a marked value sample set VS.
7. The method for extracting key elements of long texts as claimed in claim 6, wherein in the step 8.2, the un-labeled sample u is calculated through expression 1) and expression 2) m Uncertainty compared to the global modelDACS(u m ):
Figure 854377DEST_PATH_IMAGE001
1);
Figure 574072DEST_PATH_IMAGE002
2);
Wherein the content of the first and second substances,Nas unlabeled sample u m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2,……JI’=1,2,……Jand is andII’
Figure 352672DEST_PATH_IMAGE003
extracting base model M for optimal key elements I * For character w k The prediction tag of (a) is determined,
Figure 423396DEST_PATH_IMAGE004
extracting base model M for optimal key elements I’ * For character w k The predictive tag of (a);
in the step 8.4, the unlabeled sample u is calculated through the expression 3) m Maximum semantic similarity to all cluster centersSimwt(u m ):
Figure 211223DEST_PATH_IMAGE005
3);
Wherein, the first and the second end of the pipe are connected with each other,
Figure 785424DEST_PATH_IMAGE006
the function represents the Min-Max normalization function,CosineSimSenEmb m SenEmb d ) Expression sentence vector SenEmb m And a clustering center X d Sentence vector SenEmb d Cosine similarity between them;
in the step 8.5, the unlabeled sample u is obtained by calculation of the expression 4) m Information density ofInfo(u m ):
Figure DEST_PATH_IMAGE007
4);
Wherein the content of the first and second substances,μis a preset regulation factor.
8. The method for extracting key elements from long texts according to claim 7, wherein the step ten comprises:
step 10.1: preprocessing the test set Te and obtaining a text sequence ts of each test sample q
Step 10.2: extracting the text sequence ts of the base model for each test sample through each optimal key element after training q Predicting to obtain a text sequence ts q Corresponding maximum probability label sequences, and forming a candidate label sequence set by the maximum probability label sequences of all test texts;
step 10.3: definition a e For text sequences ts q ToeThe characters of each position in the candidate label sequence set are corresponding to the character a of each maximum probability label sequence e The predictive label of (2) is the character a of the predictive label with the largest number of occurrences e The final label, resulting in the text sequence ts q Corresponding final tag sequence tl q And extracting results as key elements.
9. A long text key element extraction system for implementing the long text key element extraction method of any one of claims 1~8, comprising a definition module (1), a preprocessing module (2), a model construction module (3), a data enhancement module (4), a model optimization module (5), a sample expansion module (6) and a model test module (7);
definition module (1): for obtaining a long text data set;
a pre-processing module (2): the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for preprocessing text data to obtain a text sequence and a label sequence;
model construction module (3): the method comprises the steps of performing continuous pre-training and model migration on an open source model to obtain a key element extraction base model;
data enhancement module (4): the method comprises the steps of performing data enhancement on a training set Tr to obtain an enhancement set En;
model optimization module (5): the system is used for updating parameters of the key element extraction base model through the enhancement set;
sample expansion module (6): the system is used for generating a value-labeled sample set and adding the value-labeled sample set into a training set Tr;
model test module (7): and the method is used for extracting the key elements of the test set Te through the optimal key element extraction base model.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for extracting key elements of a long text according to any one of claims 1~8 when executing the computer program.
CN202211592205.4A 2022-12-13 2022-12-13 Method, system and terminal device for extracting key elements of long text Active CN115600602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211592205.4A CN115600602B (en) 2022-12-13 2022-12-13 Method, system and terminal device for extracting key elements of long text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211592205.4A CN115600602B (en) 2022-12-13 2022-12-13 Method, system and terminal device for extracting key elements of long text

Publications (2)

Publication Number Publication Date
CN115600602A CN115600602A (en) 2023-01-13
CN115600602B true CN115600602B (en) 2023-02-28

Family

ID=84851846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211592205.4A Active CN115600602B (en) 2022-12-13 2022-12-13 Method, system and terminal device for extracting key elements of long text

Country Status (1)

Country Link
CN (1) CN115600602B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116245197B (en) * 2023-02-21 2023-11-07 北京数美时代科技有限公司 Method, system, medium and equipment for improving training rate of language model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506696A (en) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 Information extraction method and device based on small number of training samples
CN111639163A (en) * 2020-04-29 2020-09-08 深圳壹账通智能科技有限公司 Problem generation model training method, problem generation method and related equipment
US20220019742A1 (en) * 2020-07-20 2022-01-20 International Business Machines Corporation Situational awareness by fusing multi-modal data with semantic model
CN114648016A (en) * 2022-03-29 2022-06-21 河海大学 Event argument extraction method based on event element interaction and tag semantic enhancement
CN114741473B (en) * 2022-04-17 2023-04-18 中国人民解放军国防科技大学 Event extraction method based on multi-task learning
CN115510180A (en) * 2022-09-30 2022-12-23 中国电子科技集团公司第十研究所 Multi-field-oriented complex event element extraction method

Also Published As

Publication number Publication date
CN115600602A (en) 2023-01-13

Similar Documents

Publication Publication Date Title
CN108897857B (en) Chinese text subject sentence generating method facing field
Santra et al. Genetic algorithm and confusion matrix for document clustering
Wang et al. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec
CN110413986A (en) A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN111144119A (en) Entity identification method for improving knowledge migration
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN114676346A (en) News event processing method and device, computer equipment and storage medium
Ding et al. A Knowledge-Enriched and Span-Based Network for Joint Entity and Relation Extraction.
CN113920379A (en) Zero sample image classification method based on knowledge assistance
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
Jiang et al. A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN112948570A (en) Unsupervised automatic domain knowledge map construction system
CN112270185A (en) Text representation method based on topic model
CN113158079B (en) Case public opinion timeline generation method based on difference case elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant