CN115600602B

CN115600602B - Method, system and terminal device for extracting key elements of long text

Info

Publication number: CN115600602B
Application number: CN202211592205.4A
Authority: CN
Inventors: 李芳芳; 曾咏哲; 胡世雄; 罗垲炜; 甘甜
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-02-28
Anticipated expiration: 2042-12-13
Also published as: CN115600602A

Abstract

The invention provides a method, a system and a terminal device for extracting key elements of a long text, wherein the method for extracting the key elements comprises the steps of dividing a labeled sample set into a training set, a verification set and a test set; performing semantic-based data enhancement on the training set to obtain an enhanced set; updating parameters of a plurality of key element extraction base models through the enhancement set; expanding the training set by converting unlabeled samples into labeled value sample sets and then performing cyclic training; and finally confirming the optimal key element extraction base model. The key element extraction system comprises a data enhancement module, a model optimization module, a sample expansion module and the like; the terminal device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor; the method solves the problems of low quality of the generated labeled sample and low precision of element extraction in the key element extraction method based on the traditional small sample learning framework.

Description

Method, system and terminal device for extracting key elements of long text

Technical Field

The invention relates to the technical field of text information extraction, in particular to a method, a system and a terminal device for extracting key elements of a long text.

Background

Named entity recognition tasks aim at obtaining entities and entity classes with specific meanings, such as person names, place names, organization names, and the like, from a large amount of text data. The task is one of important subtasks of natural language processing, and provides bottom-layer support for multiple tasks such as intelligent question answering, knowledge graph, syntactic analysis and the like as a key technology.

The named entity recognition technology is continuously developed, the method based on statistics and manual definition rules in the early stage is developed to the method based on feature engineering and machine learning, and then the method based on deep learning which is popular in recent years is developed, and the model recognition effect is obviously improved. With the proliferation of network information resources and the continuous development of intelligent information extraction technologies, the universal named entity identification technology is difficult to meet the requirements of domain named entity identification, the network long text key element extraction task focuses more on named entities related to public sentiment services, and the following problems exist in the network long text key element extraction process:

(1) In the prior art, the key element extraction algorithm for the network long text adopts deep learning, the extraction performance depends on the scale and quality of the labeled corpus to a great extent, and because the labeled network long text corpus is scarce, sufficient data is often difficult to acquire to train a model, so that the model cannot capture sufficient data patterns;

in the existing research, small sample learning is usually used to extract key elements so as to effectively solve the problem of small amount of training sample data, but the conventional small sample learning scheme often adopts a single strategy, including a method based on data enhancement, a method based on transfer learning, a method based on active learning, a method based on weak supervised learning, and the like, and has respective strategy disadvantages to a certain extent, for example: 1) The active learning-based key element extraction method generally has the problem of single sampling criterion, only one of indexes such as uncertainty, diversity and the like is considered, and an expert is still required to participate in a data annotation process; 2) The key element extraction method based on simple data enhancement is easy to ignore high-order characteristic information of the layers of syntax, semantics and the like of a text in the process of expanding data, and cannot perform comprehensive modeling aiming at the characteristics of the data, so that the quality of a training sample is low, and more noise is introduced into a key element extraction model; 3) The prior art does not fully combine the knowledge of the existing model and various strategies to effectively utilize mass un-labeled network long text data, and the cost of training the model is still higher.

(2) Different from named entity identification in the public field, the key element extraction task of the network long text mainly focuses on named entities closely related to the public opinion field, and aims to extract boundaries and categories of the public opinion named entities from unstructured network long texts. However, the fine-grained public opinion named entities have the characteristics of various categories, strong domain characteristics and the like, for example, the definition of "event-related entity" in the public opinion domain may be divided into elements with public opinion domain properties, such as "event-related person", "event-related media", "event-related company" and the like, and the classification of the elements is closely related to the context expression. The existing method for extracting key elements of the network long text is difficult to process long-distance dependence characteristics among public opinion named entities, and entity classification errors are easily caused.

Aiming at the technical problems of the prior art of extracting key elements in the network long text that the high-quality text data is deficient and the extraction precision is not high, an effective solution is not provided for a while. Therefore, how to construct a training sample expansion scheme for a small number of labeled samples and a large number of unlabeled samples, and construct a key element extraction model for a long network text to obtain a high-precision key element extraction result becomes a problem to be solved urgently.

In summary, a method, a system and a terminal device for extracting key elements of long texts are urgently needed to solve the problems in the prior art.

Disclosure of Invention

The invention aims to provide a method, a system and a terminal device for extracting key elements of a long text, which aim to solve the problem of improving the extraction precision of the key elements of the long text.

In order to achieve the purpose, the invention provides a method for extracting key elements of a long text, which comprises the following steps:

the method comprises the following steps: acquiring a long text data set, and dividing the long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; dividing an unlabeled sample set U into a plurality of subsets;

step two: preprocessing a training set Tr to obtain a text sequence and a label sequence of each training sample in the training set Tr;

step three: continuously pre-training the open source model to obtain a plurality of pre-training language models, and performing model migration through the pre-training language models to obtain a plurality of key element extraction base models;

step four: combining a text sequence and a label sequence of each training sample in the training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhanced set En;

step five: updating parameters of the extracted basic models of the key elements through an enhancement set En;

step six: preprocessing the verification set Va, acquiring a text sequence and a label sequence of each verification sample in the verification set Va, inputting the text sequence and the label sequence into each key element extraction base model respectively, and determining an optimal key element extraction base model corresponding to each key element extraction base model;

step seven: judging whether a training stopping criterion is met; if the training stopping criterion is met, executing the step ten, and if the training stopping criterion is not met, executing the step eight;

step eight: inputting a subset in the unlabeled sample set U into each optimal key element extraction base model to obtain a corresponding labeled value sample set;

step nine: adding the value-labeled sample set obtained in the step eight into a training set Tr, and removing the subset obtained in the step eight from an unlabeled sample set U; repeating the fourth step to the ninth step until the training stopping criterion is met;

step ten: and performing key element extraction on the test set Te through the optimal key element extraction base model.

Preferably, the third step includes:

step 3.1: continuously pre-training the source model through a public opinion corpus to obtain a corresponding pre-training language model, wherein the pre-training language model comprises: a DeBERta model, a LEBERT model, a CogBERT model, a Syntax model, and a Senntence-BERT model;

step 3.2: carrying out model migration on part of the pre-training language models, wherein the obtained key element extraction base model comprises the following steps: the DeBERTA-BiONLSTM-MHA-MCRF model, the LEBERT-BiONLSTM-MHA-MCRF model, and the CogBERT-BiONLSTM-MHA-MCRF model.

Preferably, the fourth step includes:

step 4.1: define the th in the training set TriA training sample tr _i Is s _i And the tag sequence is l _i Obtaining a text sequence s by a SyntaxBERT model _i Semantic vector of each character in the text sequence s _i A tag sequence l _i Extracting training samples tr _i Form a subset of entity samples Ent _i First of a subset of the entity samplesjIndividual entity sample ent _j =<StrEnt _j ，Type _j >Wherein StrEnt _j As entity sample ent _j Is a string representation of, type _j Is an entity sample ent _j The entity class of (1);

step 4.2: for StrEnt _j The semantic vector of the character is averaged to obtain an entity sample ent _j The corresponding entity vector entEmb _j ；

Step 4.3: forming entity samples in entity sample subsets corresponding to all training samples into an entity sample set Ent of the training set, calculating cosine similarity between entity vectors of any two entity samples in the entity sample set Ent, and when the Sim is larger than or equal to sigma and the entity classes of the two entity samples are the same, respectively adding the two entity samples into semantic neighbor sets of each other, wherein the sigma is a preset entity similarity threshold;

step 4.4: traverse each training sample tr _i If the entity sample ent _j Is not an empty set, then at ent _j Semantic neighbor set ofIn which one entity sample pair is selected _j Replacing to obtain an enhanced text sequence s _i ^* And its corresponding tag sequence l _i ^* Thus obtaining the sentence pair sample pair to be evaluated _i =<s _i ，s _i ^* >；

Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s _i The sentence vector of is SenEmb _i ，s _i ^* The sentence vector of is SenEmb _i ^* Calculate SenEmb _i And SenEmb _i ^* Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced _i ^* And its corresponding tag sequence l _i ^* As an extended sample, adding all the extended samples into a training set Tr to obtain an enhanced set En; where β is a preset sentence similarity threshold.

Preferably, in the sixth step, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model.

Preferably, in the seventh step, the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va has reached a preset performance threshold α, or the training set Tr has reached a preset data volume.

Preferably, the step eight includes:

step 8.1: extracting each unlabeled sample u in the base model pair subset US through the optimal key elements _m Us of the text sequence _m Predicting to obtain a weak label sequence set corresponding to the sample; wherein the content of the first and second substances,m=1,2, … … Z, Z is the total number of unlabeled samples in the subset;

step 8.2: calculating an unlabeled sample u according to a Dropout consistency score function _m Uncertainty compared to the global modelDACS(u _m )；

Step 8.3: respectively obtaining unlabeled samples u through a sequence-BERT model _m Sentence vector SenEmb _m And the sentence vectors of the marked samples, and clustering the sentence vectors of all the marked samples to obtain D clustering centers X _d ，d=1，2，……D；

Step 8.4: calculating unlabeled sample u _m Maximum semantic similarity to all cluster centersSimwt(u _m )；

Step 8.5: binding unlabeled samples u _m Calculating to obtain an unlabeled sample u _m Information density ofInfo(u _m ) (ii) a If it isInfo(u _m ) If the value is more than or equal to theta, in the weak label sequence set corresponding to the sample, for each maximum probability label sequence pair character w _k The predictive tag of (2) is used as the character w _k Final label, in which the character w _k As unlabeled sample u _m Us of a text sequence _m To (1)kA character of each position;

thereby obtaining a text sequence us _m Corresponding tag sequence ul _m Will text sequence us _m And the tag sequence ul _m And forming the marked value samples, wherein all the marked value samples form a marked value sample set VS.

Preferably, in the step 8.2, the unlabeled sample u is calculated by the expressions 1) and 2) _m Uncertainty compared to the global modelDACS(u _m )：

1）；

2）；

Wherein the content of the first and second substances,Nas unlabeled sample u _m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2，……J，I’=1,2，……Jand is made ofI≠I’；

Extracting base model M for optimal key elements _I ^* For character w _k The predicted tag of (a) is determined,

extracting base model M for optimal key elements _I’ ^* For character w _k The predictive tag of (a);

in the step 8.4, the unlabeled sample u is calculated through the expression 3) _m Maximum semantic similarity to all cluster centersSimwt(u _m )：

3）；

Wherein the content of the first and second substances,

the function represents the Min-Max normalization function,CosineSim(SenEmb _m ，SenEmb _d ) Representing a sentence vector SenEmb _m And a clustering center X _d Sentence vector SenEmb _d Cosine similarity between them;

in the step 8.5, the unlabeled sample u is obtained by calculation of the expression 4) _m Information density ofInfo(u _m )：

4）；

Wherein, the first and the second end of the pipe are connected with each other,μis a preset regulatory factor.

Preferably, the step ten includes:

step 10.1: preprocessing the test set Te and obtaining a text sequence ts of each test sample _q ；

Step 10.2: respectively extracting the text sequence ts of the base model for each test sample through each optimal key element after training _q Predicting to obtain a text sequence ts _q Corresponding maximum probability label sequences, and forming a candidate label sequence set by the maximum probability label sequences of all test texts;

step 10.3: definition a _e For text sequences ts _q To (1)eThe characters of each position in the candidate label sequence set are corresponding to the character a of each maximum probability label sequence _e The predicted label of (2) is the character a with the predicted label with the largest number of occurrences _e The final label, resulting in the text sequence ts _q Corresponding final tag sequence tl _q And extracting results as key elements.

The invention also provides a system for extracting the key elements of the long text, which is used for realizing the method for extracting the key elements of the long text and comprises a definition module, a preprocessing module, a model construction module, a data enhancement module, a model optimization module, a sample expansion module and a model test module;

a definition module: for obtaining a long text data set;

a pretreatment module: the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for preprocessing text data to obtain a text sequence and a label sequence;

a model construction module: the method comprises the steps of performing continuous pre-training and model migration on an open source model to obtain a key element extraction base model;

the data enhancement module: the method comprises the steps of performing data enhancement on a training set Tr to obtain an enhancement set En;

a model optimization module: the parameter updating module is used for updating parameters of the key element extraction base model through the enhancement set;

a sample expansion module: the system is used for generating a value-labeled sample set and adding the value-labeled sample set into a training set Tr;

a model testing module: and the method is used for extracting the key elements of the test set Te through the optimal key element extraction base model.

The invention also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method for extracting the key elements of the long text.

The technical scheme of the invention has the following beneficial effects:

(1) According to the invention, by using the key element extraction method integrating data enhancement, transfer learning, active learning and weak supervised learning, value samples with more abundant information amount can be screened out in a low-resource scene and automatically labeled accurately and efficiently, and meanwhile, a large number of high-quality labeled samples are generated under the condition that the syntactic structure and the semantics of the original text are not changed as far as possible, so that the manual labeling cost is remarkably reduced, unlabeled samples are fully utilized to the maximum extent, the performance of a key element extraction model is effectively improved, and the problems of low quality of the labeled samples, low element extraction precision and high cost of expert labeled value samples generated in the key element extraction method based on the traditional small sample learning framework are solved.

(2) According to the method, the pre-training is continuously carried out on the open source model on the large-scale public opinion corpus, the parameter-based transfer learning is carried out on the open source model, the pre-training language model transfers the weight parameters to the shared parameters of the key element extraction base model, a plurality of key element extraction base models capable of fully excavating long-distance dependent semantic features in the text are constructed, the performance of the model is initialized to the maximum extent by using the existing knowledge under the condition that the data in the field of long-text network is less, the model can be finely adjusted by the network long-text labeling data of the target field subsequently, the performance of the model on the task of extracting the key elements of the long-text network is improved, and the computing resources and the cost are saved.

(3) In the invention, data enhancement is carried out based on semantics, when data enhancement is carried out on network long text data, the text is coded based on a Syntax pre-training language model and a Sennce-BERT pre-training language model, compared with the traditional pre-training language model, the syntax and semantic information of the text can be better coded, in the enhancement process, a target entity which belongs to the same entity type as the original entity and has similar semantics with the original entity is firstly used for replacing the original entity, an enhanced Sentence after entity replacement is obtained, and the enhanced Sentence and a tag sequence thereof which have similar semantics with the original Sentence are further stored. The method can efficiently encode the syntax and semantic information of the text, protect the semantic feature information of the entity level and the sentence level and better improve the quality of the enhanced data, so the method can furthest reduce the loss of the semantic information of the entity level and the sentence level in the data enhancement process, improve the quality of the enhanced data, simultaneously obviously reduce the cost of manual labeling and obviously improve the extraction precision of the key element extraction model.

(4) In the invention, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is taken as the model parameter of each corresponding optimal key element extraction base model, because the F1 value is the harmonic average value of the accuracy rate and the recall rate, and the accuracy rate and the recall rate are balanced and considered at the same time, so that the method is a more comprehensive evaluation index.

(5) In the invention, value samples are selected by combining Dropout consistency indexes (namely uncertainty) and semantic similarity indexes (namely maximum semantic similarity), and a combined query criterion is utilized, wherein the criterion can ensure that the selected samples have the most uncertainty based on the current overall model, and the information content of the samples in the whole situation is also considered by the semantic similarity, so that the negative influence of outlier isolated samples on the overall model is relieved. Compared with the traditional active learning sampling criterion, the combined query criterion more comprehensively describes the samples to be screened, and is favorable for selecting the samples with richer information content; the method comprises the steps of using a plurality of optimal key element extraction base models to predict the same value sample to construct a weak supervision learning scene, carrying out preferential inference on a predicted entity through a multi-model voting mode at an entity level, further exerting the advantages of different weak supervision models, further obtaining higher key element extraction precision on the basis of a plurality of homogeneous base models, and enabling the integrated model to have higher generalization and robustness compared with a single base model.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. In the drawings:

fig. 1 is a flowchart of a method for extracting key elements from a long text in embodiment 1 of the present application;

fig. 2 is a schematic structural diagram of a long text key element extraction system according to embodiment 1 of the present application;

the system comprises a definition module 1, a preprocessing module 2, a model construction module 3, a data enhancement module 4, a model optimization module 5, a sample expansion module 6 and a model testing module 7.

Detailed Description

Embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways, which are defined and covered by the claims.

Example 1:

referring to fig. 1 to 2, the present embodiment is applied to key element extraction of a long text on a network, and is convenient for timely and accurately extracting key public opinion information from the long text on the network.

A method for extracting key elements of a long text, referring to FIG. 1 (S1-S10 in the figure represent steps one to ten), comprises the following steps:

the method comprises the following steps: acquiring a network long text data set on a network platform, and dividing the long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; dividing an unlabeled sample set U into a plurality of subsets;

step three: continuously pre-training the open source model to obtain a plurality of pre-training language models, and carrying out model migration through the pre-training language models to obtain a plurality of key element extraction base models M _I ；I=1,2，……J，JThe method specifically comprises the following substeps of extracting the number of models of the base model for the key elements:

step 3.1: collecting about 100 pieces of Internet public opinion texts from a 'people data' platform to complete the construction of a public opinion corpus; continuing to pre-train the sourcing model through a public opinion corpus to obtain a corresponding pre-trained language model, wherein the pre-trained language model comprises: a DeBERta model, a LEBERT model, a CogBERT model, a Syntax model, and a Senntence-BERT model;

step 3.2: model migration is carried out on part of the pre-training language model through a parameter-based migration learning method, and a corresponding key element extraction base model M is obtained by combining a Bi-directional Ordered neural Long Short Term Memory network (Bi-directional Ordered nerves Long Short-Term Memory, biONLSTM), a Multi-Head Attention Mechanism (MHA) and a mask added Conditional Random Field (MCRF) _I In this embodiment, three key element extraction basis models are obtained: deBERTA-BiONLSTM-MHA-MCRF model (defined as M) ₁ ) The LEBERT-BiONLSTM-MHA-MCRF model (defined as M) ₂ ) And the CogBERT-BiONLSTM-MHA-MCRF model (defined as M) ₃ ）。

The method comprises the steps of continuously pre-training an open source model on a large-scale public opinion corpus, performing parameter-based transfer learning on the open source model, transferring weight parameters to shared parameters of a key element extraction base model by a pre-training language model, constructing and obtaining a plurality of key element extraction base models capable of fully excavating long-distance dependence semantic features in a text, initializing the performance of the model by utilizing the existing knowledge to the maximum extent under the condition that data in the field of long-text networks are few, finely adjusting the model by network long-text labeling data of a target domain subsequently, assisting in improving the performance of the model on a network long-text key element extraction task, and saving computing resources and cost.

Step four: combining a text sequence and a label sequence of each training sample in a training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhancement set En; the method specifically comprises the following substeps:

step 4.1: define the second in the training set TriA training sample tr _i Is s _i The sequence of the tag is l _i Obtaining a text sequence s by a SyntaxBERT model _i Semantic vector of each character in the text sequence s _i A tag sequence l _i Extracting training samples tr _i Form a subset of entity samples Ent _i First of a subset of the entity samplesjIndividual entity sample ent _j =<StrEnt _j ，Type _j >Wherein StrEnt _j Is an entity sample ent _j Type of _j Is an entity sample ent _j The entity class of (1);

step 4.2: for entity sample subset Ent _i Each entity sample ent in _j For constituting character string, strEnt is expressed _j The semantic vector of the character is averaged to obtain the entity sample ent _j The corresponding entity vector entEmb _j ；

for example, any two entity samples Ent in the entity sample set Ent corresponding to the training set Tr are taken _x 、ent _y The semantic neighbor sets corresponding to the two entity samples are Ne _x 、Ne _y Calculating the entity vector entEmb of the two entity samples _x 、entEmb _y Cosine similarity Sim between two entity samples, when Sim is greater than or equal to sigma and entity Type of two entity samples _x =Type _y When it is, if ent _x ∉Ne _y Then will be ent _x Addition of Ne _y In if ent _y ∉Ne _x Then will be ent _y Adding Ne _x In (1).

Step 4.4: traverse each training sample tr _i If the entity sample ent _j Semantic neighbor set Ne of _j If not, then it is in ent _j Randomly selecting an entity sample pair in the semantic neighbor set _j Replacing to obtain an enhanced text sequence s _i ^* And its corresponding tag sequence l _i ^* Thereby obtaining a sentence pair sample pair to be evaluated _i =<s _i ，s _i ^* >；

Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s _i The sentence vector of is SenEmb _i ，s _i ^* The sentence vector of is SenEmb _i ^* Calculate SenEmb _i And SenEmb _i ^* Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced _i ^* And corresponding theretoTag sequence l _i ^* As an extended sample, constructing all extended samples into an extended sample set Exp, adding the extended sample set Exp into a training set Tr to obtain an enhanced set En, namely En = Tr U Exp; where β is a preset sentence similarity threshold.

The traditional text Data enhancement scheme is usually realized based on simple Data Enhancement (EDA), the EDA generates an extended sample similar to an original labeled sample through a noise adding strategy of synonym replacement, random insertion, random exchange and random deletion, but a network long text key element extraction task is greatly influenced by entity context semantic information, and the EDA is directly applied to the network long text key element extraction task to easily generate an extended sentence with incorrect semantic meaning, so that larger noise influence is brought, and the semantic information of an entity level sentence and a level sentence is kept as much as possible in the Data enhancement process.

In view of the above problems, in the fourth step of this embodiment, data enhancement is performed based on semantics, and when data enhancement is performed on network long text data, a text is encoded based on a SyntaxBERT pre-training language model and a Sentence-BERT pre-training language model, which can better encode syntax and semantic information of the text compared with a conventional pre-training language model. The method can efficiently encode the syntax and semantic information of the text, protect the semantic feature information of the entity level and the sentence level and better improve the quality of the enhanced data, so the method can furthest reduce the loss of the semantic information of the entity level and the sentence level in the data enhancement process, improve the quality of the enhanced data, simultaneously obviously reduce the cost of manual labeling and obviously improve the extraction precision of the key element extraction model.

Step five: training a plurality of key element extraction base models through an enhancement set En, and realizing parameter updating;

step six: preprocessing the verification set Va, acquiring a text sequence and a label sequence of each verification sample in the verification set Va, and respectively inputting the text sequence and the label sequence to each key element extraction base model M _I Determining the optimal key element extraction base model M corresponding to each key element extraction base model _I ^* (ii) a Specifically, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model, so that the optimal key element extraction base model M corresponding to each key element extraction base model is obtained _I ^* This is because the F1 value is a harmonic mean value of the precision rate and the recall rate, and is a more comprehensive evaluation index while balancing and considering the precision rate and the recall rate.

Step seven: judging whether a training stopping criterion is met;

the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va reaches a preset performance threshold value alpha, or the training set Tr reaches a preset data volume. The performance threshold alpha and the preset data amount are set according to actual needs.

If the training stopping criterion is met, executing a step ten, and if the training stopping criterion is not met, executing a step eight;

step eight: inputting a subset in the unlabeled sample set U into each optimal key element extraction base model to obtain a corresponding labeled value sample set; the method specifically comprises the following substeps:

step 8.1: extraction of base model M by optimal key elements ₁ ^* 、M ₂ ^* 、M ₃ ^* For each unlabeled sample u in the subset US _m Us of the text sequence _m Predicting to obtain a text sequence us _m The corresponding maximum probability label sequences are respectively cl _m1 、cl _m2 、cl _m3 Adding the maximum probability label sequence corresponding to each unlabeled sample as a group of data to the weak probability label sequence corresponding to the sampleObtaining a weak label sequence set CL corresponding to the sample in the label sequence setm(ii) a Wherein the content of the first and second substances,m=1,2, … … Z, Z being the total number of unlabeled samples in the subset US;

step 8.2: calculating an unlabeled sample u according to a Dropout consistency score function _m Uncertainty compared to the global modelDACS(u _m ) (ii) a Specifically, calculating the unlabeled sample u by the expression 1) and the expression 2) _m Uncertainty compared to the global modelDACS(u _m )：

1）；

2）；

Wherein, the first and the second end of the pipe are connected with each other,Nas unlabeled sample u _m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2，……J，I’=1,2，……Jand is andI≠I’；

extracting base model M for optimal key elements _I ^* For character w _k The prediction tag of (a) is determined,

extracting base model M for optimal key elements _I’ ^* For character w _k Predictive label of, character w _k As unlabeled sample u _m Us of the text sequence _m To (1)kA character of each position.

When in useDACS(u _m ) The larger the value of (A) is, the more the sample is not labeledThis u _m The larger the uncertainty for the overall model, the higher its labeling value.

Step 8.3: obtaining unlabeled sample u by a Sennce-BERT model _m Sentence vector SenEmb _m (ii) a Sentence vectors of all marked samples are obtained through a sequence-BERT model, and the Sentence vectors of all marked samples are clustered to obtain D clustering centers X _d ，d=1,2, … … D; in this embodiment, the sentence vectors of all marked samples are clustered by the K-Means + + algorithm.

Step 8.4: calculating unlabeled sample u _m Maximum semantic similarity to all cluster centersSimwt(u _m ) (ii) a Specifically, calculating the unlabeled sample u by the expression 3) _m Maximum semantic similarity to all cluster centersSimwt(u _m )：

3）；

Wherein the content of the first and second substances,

the function represents the Min-Max normalization function,CosineSim（SenEmb _m ，SenEmb _d ) Representing a sentence vector SenEmb _m And a clustering center X _d Sentence vector SenEmb _d Cosine similarity between them.

When the temperature is higher than the set temperatureSimwt(u _m ) The larger the value of (a), the unmarked sample u is indicated _m The larger the probability of belonging to a known class, the higher the representativeness of the contained information, otherwise, it indicates that the sample u is not labeled _m The greater the probability of belonging to an outlier sample, the lower the representativeness of the contained information.

Step 8.5: binding unlabeled samples u _m Calculating to obtain an unlabeled sample u _m Information density ofInfo(u _m ) (ii) a Specifically, calculating by expression 4) to obtain an unlabeled sample u _m Information density ofInfo(u _m )：

4）；

Wherein the content of the first and second substances,μand setting the regulation factors for preset regulation factors according to actual requirements.

If it isInfo(u _m ) When the value is more than or equal to theta, CL is in the weak label sequence setmFor each maximum probability tag sequence pair character w _k The predictive tag of (2) is used as the character w _k A final label;

thereby obtaining a text sequence us _m Corresponding tag sequence ul _m Will text sequence us _m And the tag sequence ul _m Forming marked value samples, wherein all marked value samples form a marked value sample set VS, and the marked value sample set VS comprises Z ^* Individual marked value sample, Z ^* ≤Z。

In many existing schemes, labeling cost required for training a key element extraction model is reduced through active learning, a single uncertainty index is often used for screening value samples in common methods, but distribution information among unlabeled samples is ignored, and in order to effectively exert the advantages of different active learning sampling criteria on a key element extraction task, a Dropout consistency index (uncertainty) and a semantic similarity index (maximum semantic similarity) are fused, the value samples are selected by utilizing a combined query criterion, the criterion can enable the selected samples to be the most uncertain based on a current overall model, and the information quantity of samples in the overall situation is also considered through the semantic similarity, so that negative effects brought by outlier isolated samples to the overall model are relieved. Compared with the traditional active learning sampling criterion, the combined query criterion more comprehensively describes the samples to be screened, and is favorable for selecting the samples with richer information content.

Furthermore, a plurality of optimal key element extraction base models are used for predicting the same marked value sample to construct a weak supervision learning scene, the entity-level multi-model voting mode is used for preferentially deducing the predicted entity, the advantages of different weak supervision models are further exerted, higher key element extraction precision can be further obtained on the basis of a plurality of homogeneous base models, and the integrated model has higher generalization and robustness compared with a single base model.

The method effectively integrates two strategies of active learning and weak supervised learning, saves the expensive cost of expert labeling data in the traditional active learning method, can screen out value samples with richer information content and accurately and efficiently label the value samples automatically, obviously reduces the manual labeling cost, and fully utilizes the unlabeled samples to improve the performance of extracting key elements.

Step nine: adding the marked value sample set VS obtained in the step eight into the training set Tr to form a new training set, so that the data volume in the training set can be increased, and removing the subset US in the step eight from the unmarked sample set U to form a new unmarked sample set; repeating the fourth step to the ninth step until the training stopping criterion is met;

Step 10.2: extracting base model M through each optimal key element after training ₁ ^* 、M ₂ ^* 、M ₃ ^* Text sequence ts for each test sample separately _q Predicting to obtain a text sequence ts _q Corresponding maximum probability tag sequence tl _q1 、tl _q2 、tl _q3 Forming the maximum probability label sequences of all test texts into a candidate label sequence set TLq；

Step 10.3: definition a _e For text sequences ts _q ToeCharacters of individual positions in the candidate tag sequence set TLqFor each maximum probability tag sequence pair character a _e The predicted label of (2) is the character a with the predicted label with the largest number of occurrences _e The final label, resulting in the text sequence ts _q Corresponding final tag sequence tl _q And extracting results as key elements.

According to the method, the value samples with richer information content can be screened out and automatically labeled accurately and efficiently in a low-resource scene, a large number of high-quality labeled samples are generated under the condition that the syntax structure and the semantics of the original text are not changed as far as possible, the manual labeling cost is obviously reduced, the unlabeled samples are fully utilized to the maximum extent, the performance of a key element extraction model is effectively improved, and the problems that the labeled samples generated in the key element extraction method based on the traditional small sample learning framework are low in quality, low in element extraction precision and high in cost of expert labeled value samples are solved.

When the network platform is monitored for public sentiment, the method can timely and accurately extract key public sentiment information from the network long text, can assist governments and enterprises to find potential risk hazards and focuses of network public sentiment events, and promotes the modernization of public sentiment monitoring systems and social governance capacity.

A key element extraction system of a long text is used for realizing the key element extraction method of the long text, and comprises a definition module 1, a preprocessing module 2, a model construction module 3, a data enhancement module 4, a model optimization module 5, a sample expansion module 6 and a model test module 7, as shown in FIG. 2;

the definition module 1: the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring a network long text data set and dividing the network long text data set into a labeled sample set L and an unlabeled sample set U; dividing the marked sample set L into a training set Tr, a verification set Va and a test set Te; the unlabeled sample set U is divided into a plurality of subsets.

The pretreatment module 2: the method is used for preprocessing the text data in the training set Tr, the verification set Va and the test set Te to obtain corresponding text sequences and label sequences.

The model construction module 3: the method is used for continuously pre-training the open source model and carrying out model migration to obtain a key element extraction base model M _I 。

The data enhancement module 4: and the method is used for enhancing data of the training set Tr to obtain an enhanced set En.

The model optimization module 5: for extracting a base model M from each key element through an enhancement set En _I Updating parameters; preprocessing a verification set Va to obtain a text sequence and a label sequence of each verification sample, and inputting the text sequence and the label sequence into each key element extraction base model M _I In the method, the optimal key element extraction base model M with the highest F1 value in the verification set Va is obtained and stored _I ^* Extracting key elements for the test set Te; until the current situation meets the training stop criterion.

The sample expansion module 6: if the current situation does not meet the training stopping criterion, inputting a subset US in the unlabeled sample set U into each optimal key element extraction base model M _I ^* The method comprises the steps of predicting, screening unlabeled samples through improved value sample sampling and labeling criteria to obtain a labeled value sample set VS, adding the obtained labeled value sample set VS into a training set Tr to form a new training set, increasing the data volume in the training set, and removing a subset US from an unlabeled sample set U to form a new unlabeled sample set.

Model test module 7: inputting the text sequence of the test sample of the test set Te into the trained optimal key element extraction base model M _I ^* And obtaining the maximum probability label sequence predicted by each model, and performing result integration on each maximum probability label sequence to obtain a key element extraction result.

A terminal device, comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the above-mentioned method for extracting key elements of long texts when executing the computer program.

Comparative example 1:

this comparative example differs from example 1 only in that the optimal DeBERTA-BiONLSTM-MHA-MCRF model (M) ₁ ^* ) And extracting the key elements.

Comparative example 2:

this comparative example differs from example 1 only in that the optimal LEBERT-BiONLSTM-MHA-MCRF model (M) is used ₂ ^* ) And extracting the key elements.

Comparative example 3:

this comparative example differs from example 1 only in that the optimal CogBERT-BiONLSTM-MHA-MCRF model (M) was used ₃ ^* ) And extracting the key elements.

Comparative example 4:

this comparative example differs from example 1 in that the Ensemble model (fusion of M by the weakly supervised learning framework) was used ₁ ^* 、M ₂ ^* 、M ₃ ^* The integration model) extracts the key elements.

Comparative example 5:

the present comparative example differs from example 1 in that key elements are extracted using the Ensemble (AL) model, AL (Active Learning) representing the use of an Active Learning strategy.

Comparative example 6:

the present comparative example differs from example 1 in that key elements are extracted using the Ensemble (DA) model, which represents the use of a Data enhancement strategy.

The experimental results of example 1 and the above comparative example on the validation set Te are shown in table 1:

table 1:

as can be seen from the above table, compared with other key element extraction models, the key element extraction model of the method obtains an optimal key element extraction result on the test set, and has a significant technical effect.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting key elements of a long text is characterized by comprising the following steps:

step four: combining a text sequence and a label sequence of each training sample in a training set Tr, and performing semantic-based data enhancement on the training set Tr to obtain an enhancement set En;

step five: updating parameters of a plurality of key element extraction base models through an enhancement set En;

step seven: judging whether the training stopping criterion is met; if the training stopping criterion is met, executing a step ten, and if the training stopping criterion is not met, executing a step eight;

step eight: inputting a subset in the unmarked sample set U into each optimal key element extraction base model to obtain a corresponding marked value sample set;

step nine: adding the marked value sample set obtained in the step eight into a training set Tr, and removing the subset in the step eight from an unmarked sample set U; repeating the fourth step to the ninth step until the training stopping criterion is met;

step ten: and extracting key elements from the test set Te through the optimal key element extraction base model.

2. The method for extracting key elements of long texts according to claim 1, wherein the third step comprises:

step 3.1: continuously pre-training the source model through a public opinion corpus to obtain a corresponding pre-training language model, wherein the pre-training language model comprises: a DeBERTA model, a LEBERT model, a CogBERT model, a Syntax-BERT model, and a Sennce-BERT model;

step 3.2: carrying out model migration on part of the pre-training language models, wherein the obtained key element extraction base model comprises the following steps: deBERta-BiONLSTM-MHA-MCRF model, LEBERT-BiONLSTM-MHA-MCRF model, and CogBERT-BiONLSTM-MHA-MCRF model.

3. The method for extracting key elements of long texts according to claim 2, wherein the fourth step comprises:

step 4.1: define the second in the training set TriA training sample tr _i Is s _i The sequence of the tag is l _i Obtaining a text sequence s by a SyntaxBERT model _i Semantic vector of each character and according to the text sequence s _i A tag sequence l _i Extracting training samples tr _i Form a subset of entity samples Ent _i (iii) solid sampleThe first in this subsetjIndividual entity sample ent _j =<StrEnt _j ，Type _j >Wherein StrEnt _j Is an entity sample ent _j Is a string representation of, type _j Is an entity sample ent _j The entity category of (a);

step 4.2: for StrEnt _j The semantic vector of the character is averaged to obtain an entity sample ent _j Corresponding entity vector entEmb _j ；

Step 4.3: forming entity samples in entity sample subsets corresponding to all training samples into an entity sample set Ent of a training set, calculating cosine similarity between entity vectors of any two entity samples in the entity sample set Ent, and respectively adding the two entity samples into semantic neighbor sets of each other when the Sim is larger than or equal to sigma and the entity classes of the two entity samples are the same, wherein the sigma is a preset entity similarity threshold;

step 4.4: traverse each training sample tr _i If the entity sample ent _j Is not an empty set, then at ent _j Selects an entity sample object from semantic neighbor set _j Replacing to obtain an enhanced text sequence s _i ^* And its corresponding tag sequence l _i ^* Thus obtaining the sentence pair sample pair to be evaluated _i =<s _i ，s _i ^* >；

Step 4.5: obtaining a Sentence vector of each Sentence to be evaluated to the sample through a sequence-BERT model, wherein s _i The sentence vector of is SenEmb _i ，s _i ^* Sentence vector of SenEmb _i ^* Calculate SenEmb _i And SenEmb _i ^* Cosine similarity between the text sequences is SimSem, when the SimSem is more than or equal to beta, the text sequence s is enhanced _i ^* And its corresponding tag sequence l _i ^* As an extended sample, and allAdding the extended sample into a training set Tr to obtain an enhanced set En; where β is a preset sentence similarity threshold.

4. The method for extracting key elements from a long text according to claim 3, wherein in the sixth step, the model parameter with the highest F1 value in each key element extraction base model in the whole training process is used as the model parameter of each corresponding optimal key element extraction base model.

5. The method for extracting key elements from long texts as claimed in claim 4, wherein in the seventh step, the training stopping criterion is that the performance of each optimal key element extraction base model on the verification set Va has reached a preset performance threshold α, or the training set Tr has reached a preset data volume.

6. The method for extracting key elements from long texts according to claim 4, wherein the eighth step comprises:

step 8.1: extracting each unlabeled sample u in the base model pair subset US through the optimal key elements _m Us of a text sequence _m Predicting to obtain a weak label sequence set corresponding to the sample; wherein the content of the first and second substances,m=1,2, … … Z, Z is the total number of unlabeled samples in the subset;

Step 8.3: respectively obtaining unlabeled samples u through a sequence-BERT model _m Sentence vector SenEmb of _m And the sentence vectors of the marked samples, and clustering the sentence vectors of all the marked samples to obtain D clustering centers X _d ，d=1，2，……D；

Step 8.5: binding unlabeled samples u _m Calculating to obtain an unlabeled sample u _m Information density ofInfo(u _m ) (ii) a If it isInfo(u _m ) If the value is more than or equal to theta, the character w is aligned to each maximum probability label sequence in the weak label sequence set corresponding to the sample _k The predicted label of (2) is used as the character w _k Final label, in which the character w _k As unlabeled sample u _m Us of the text sequence _m To (1)kA character of each position;

7. The method for extracting key elements of long texts as claimed in claim 6, wherein in the step 8.2, the un-labeled sample u is calculated through expression 1) and expression 2) _m Uncertainty compared to the global modelDACS(u _m )：

1）；

2）；

Wherein the content of the first and second substances,Nas unlabeled sample u _m The length of the sequence of (a) is,Jextracting the number of models of the base model for the optimal key elements,IandI’the model index of the two optimal key elements extraction base models M participating in the calculation,I=1,2，……J，I’=1,2，……Jand is andI≠I’；

3）；

Wherein, the first and the second end of the pipe are connected with each other,

the function represents the Min-Max normalization function,CosineSim（SenEmb _m ，SenEmb _d ) Expression sentence vector SenEmb _m And a clustering center X _d Sentence vector SenEmb _d Cosine similarity between them;

4）；

Wherein the content of the first and second substances,μis a preset regulation factor.

8. The method for extracting key elements from long texts according to claim 7, wherein the step ten comprises:

Step 10.2: extracting the text sequence ts of the base model for each test sample through each optimal key element after training _q Predicting to obtain a text sequence ts _q Corresponding maximum probability label sequences, and forming a candidate label sequence set by the maximum probability label sequences of all test texts;

step 10.3: definition a _e For text sequences ts _q ToeThe characters of each position in the candidate label sequence set are corresponding to the character a of each maximum probability label sequence _e The predictive label of (2) is the character a of the predictive label with the largest number of occurrences _e The final label, resulting in the text sequence ts _q Corresponding final tag sequence tl _q And extracting results as key elements.

9. A long text key element extraction system for implementing the long text key element extraction method of any one of claims 1~8, comprising a definition module (1), a preprocessing module (2), a model construction module (3), a data enhancement module (4), a model optimization module (5), a sample expansion module (6) and a model test module (7);

definition module (1): for obtaining a long text data set;

a pre-processing module (2): the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for preprocessing text data to obtain a text sequence and a label sequence;

model construction module (3): the method comprises the steps of performing continuous pre-training and model migration on an open source model to obtain a key element extraction base model;

data enhancement module (4): the method comprises the steps of performing data enhancement on a training set Tr to obtain an enhancement set En;

model optimization module (5): the system is used for updating parameters of the key element extraction base model through the enhancement set;

sample expansion module (6): the system is used for generating a value-labeled sample set and adding the value-labeled sample set into a training set Tr;

model test module (7): and the method is used for extracting the key elements of the test set Te through the optimal key element extraction base model.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for extracting key elements of a long text according to any one of claims 1~8 when executing the computer program.