CN117494760A

CN117494760A - Semantic tag-rich data augmentation method based on ultra-large-scale language model

Info

Publication number: CN117494760A
Application number: CN202311320484.3A
Authority: CN
Inventors: 肖濛; 周园春; 蔡勋鑫; 宁致远
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2024-02-02

Abstract

The invention discloses a semantic tag-rich data augmentation method based on a very large scale language model. The method comprises the following steps: 1) Obtaining subject text data from a plurality of data sets to form a subject database; traversing each data sample in the subject database, dividing the data samples with the same classification number into the same node of the hierarchical subject structure sampling tree, determining the position of each node according to the classification number, constructing the hierarchical subject structure sampling tree and calculating the statistical information of each node; 2) Determining a number of enhanced samples for each discipline classification; 3) Updating the hierarchical subject structure sampling tree according to the number of the enhanced samples of each subject classification, and calculating the statistical information of each node; 4) Judging whether the sample number corresponding to each subject classification is balanced or not according to the statistical information before and after updating of each node, and repeating the steps 2-3 if the sample number is unbalanced; 5) A corresponding number of data samples of the corresponding subject class is generated for each subject class using the very large scale language model.

Description

Semantic tag-rich data augmentation method based on ultra-large-scale language model

Technical Field

The invention relates to the fields of big data, super-large-scale language models, data augmentation, multi-category text classification, hierarchical multi-label classification and the like, in particular to a data augmentation method for a data set rich in semantic labels based on super-large-scale language models (Large Language Model, LLM) so as to solve the problems of unbalance among sample categories and the like.

Background

Inference of subject matter for a given study proposal is a preliminary step in automating peer review systems in which accurate subject code can help sponsor administrators assign domain-related experts for fair evaluation. Because of the inherent hierarchy of disciplines, such topic inference tasks can be defined as hierarchical multi-labeled classification tasks. However, the number of applications associated with these hierarchical subject labels is not balanced due to the development, planning and division of the primary subject (e.g., information science and mathematics science, etc.). This imbalance in data level may further impact the accuracy of the automated topic inference model for some secondary category disciplines. Furthermore, this may lead to some new disciplines being reviewed by experts unrelated to the field, further restricting the development of emerging disciplines. The hierarchical discipline labels, namely labels rich in semantic information under the condition that the hierarchical disciplines are determined, not only contain the current discipline field information, but also can know rich information such as the disciplines to which the hierarchical discipline labels belong according to the hierarchical discipline system.

Along with the rapid development of machine learning technology, the idea of amplifying data of unbalanced data to relieve model overfitting becomes an important research field, but the specificity of text data and the semantic complexity thereof make the direct application of the existing method to the task of natural language processing have a great challenge, the appearance of a very large-scale language model provides a solution idea for the challenge, and the layering and semantic characteristics of each label on a discipline system also provide possibility for data amplification by adopting a large language model. The ultra-large scale language model is a deep learning model trained by a large amount of text data, is used for understanding and generating natural language, is an artificial intelligence technology for processing and analyzing the text, can encode, generate and understand the text through learning statistical rules and semantic structures of the language, and is usually constructed by using a deep neural network, wherein the deep neural network comprises a plurality of layers of neural network structures. These models learn from them the underlying patterns, semantic relationships, and grammar rules of the language by training on large-scale text data. They can automatically extract features from the input text and convert them into a high-dimensional vector representation for subsequent text analysis and processing. The method performs deep analysis and understanding on large-scale text data by utilizing a pre-trained language model and combining deep learning and natural language processing technology, and can correspondingly generate richer and more accurate enhancement data according to rich semantic information in the tag.

Traditional semantic-rich tag data augmentation methods often rely on manual labeling and manual rules, and have higher cost and limited effect due to enrichment of semantic information in the tags. The method based on the ultra-large-scale language model can automatically learn rich semantic information from the provided labels and apply the rich semantic information to the enhanced text data.

Disclosure of Invention

The invention aims to provide a semantic tag-rich data augmentation method based on a very large scale language model. The core idea of the semantic tag-rich data augmentation method based on the ultra-large-scale language model is to encode and represent learning the text by using the pre-trained language model. These language models are usually pre-trained with extensive, unsupervised training data, enabling learning of rich semantic information and language structures. When generating the semantic-rich label data, a pre-trained language model can be used to encode the input text to obtain its semantic representation. These semantic representations may then be used to perform tasks such as label prediction, entity identification, relationship extraction, etc., to generate semantic-rich label data.

The invention utilizes the existing semantic-rich label text database to carry out sampling analysis on quantity and frequency, selects few sample categories with unbalanced frequency or quantity distribution, carries out construction of prompt words according to the categories and keywords marked by randomly provided experts, and transmits the prompt words to a large language model, and the generated text data is marked as enhancement data when being generated and is simultaneously provided with the original data to a classification model so as to reduce unbalance among category samples, thereby achieving the aim of improving classification accuracy.

The invention specifically comprises the following steps:

a semantic tag-rich data augmentation method based on a very large scale language model comprises the following steps:

1) Obtaining subject text data from a plurality of data sets to form a subject database; traversing each data sample in the discipline database, dividing the data samples with the same classification number into the same node of the hierarchical discipline structure sampling tree, determining the position of each node in the hierarchical discipline structure sampling tree according to the classification number, and constructing the hierarchical discipline structure sampling tree; each node corresponds to a sample set and a class number; counting global frequencies globalFreq and hierarchical frequencies levfreq of each node in the hierarchical subject structure sampling tree, counting father freq of the father node for each father node, and counting leaf node freq of each leaf node; the global frequency globalFreq of a node is the ratio of the number of samples in a sample set corresponding to the node to the total number of samples, the hierarchical frequency levefeq of the node is the ratio of the number of samples in the sample set corresponding to the node to the number of samples corresponding to each node on the same layer as the node, the father frequency father of a father node is the ratio of the sum of the number of samples in the sample set corresponding to all nodes under the father node to the total number of samples, and the leaf node frequency leaf freq of a leaf node is the ratio of the number of samples in the sample set corresponding to the leaf node to the total number of samples;

2) According toDetermining a number of enhanced samples for each discipline classification; wherein N is an enhancementThe total sample number, lambda is the super parameter, C is the subject classification set corresponding to the subject database, and C is one subject classification in the subject classification set C, namely C epsilon C; the number of samples of the node corresponding to discipline class c is n _c The number of samples of the node corresponding to the discipline classification i is n _i The number of enhanced samples corresponding to discipline class c is s _c ；

3) Updating the hierarchical subject structure sampling tree according to the number of the enhanced samples of each subject classification, and calculating the statistical information of each node in the hierarchical subject structure sampling tree after updating;

4) Judging whether the sample number corresponding to each subject classification is balanced or not according to the statistical information before and after updating of each node, and if not, adjusting the value of the super parameter lambda to repeat the steps 2-3); step 5) if balanced;

5) Generating an enhanced sample number list time according to the enhanced sample number of each subject class, and then generating a corresponding number of data samples of the corresponding subject class for each subject class by utilizing the pre-trained ultra-large scale language model.

Further, the method for generating the data samples of the corresponding subject classification in the corresponding number for each subject classification by utilizing the pre-trained ultra-large scale language model comprises the following steps: for discipline class c, the pre-trained very large scale language model is performed s _c And (3) enhancing the data, inputting the prompt word corresponding to the subject class c each time as the input of the pre-trained ultra-large scale language model, and obtaining the output of the pre-trained ultra-large scale language model as the data enhancement result.

Further, for the subject class c, randomly selecting a keyword from keywords belonging to the subject class c in the expert annotation keyword database as a part of the prompt word, and constructing the prompt word corresponding to the subject class c.

Further, the prompt words are constructed through a prompt word template, wherein the prompt word template comprises background knowledge, a generation principle, a generation format, a language style, disciplines and keywords, and only disciplines and keywords are changed according to different discipline classifications.

Further, the ultra-large scale language model is a BERT model, a GPT model or a LLaMA and an instruction fine tuning version model Vicuna thereof.

A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the above method.

A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the above method.

Compared with the related method in the past, the semantic tag-rich data augmentation method based on the ultra-large-scale language model has the following advantages and contributions:

(1) The method can automatically learn rich semantic information from large-scale text data, and avoids the complexity and high cost of manual labeling and rule design in the traditional method.

(2) The method based on the ultra-large scale language model can utilize the advantages of deep learning and natural language processing technology to deeply analyze and understand the text and generate more accurate and rich semantic tag data.

(3) The pre-training of the language model used by the invention is performed based on large-scale data, so that the generated semantic-rich label data also has better coverage range and representativeness.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples.

The invention aims to enhance data of unbalanced subjects through a pre-trained large language model by rich semantic tags, reduce unbalance among hierarchical subjects to improve subject reasoning effect, and mainly use thesis subject classification, thesis abstract, thesis keywords and expert-labeled keywords in an original hierarchical subject text database. The flow of the method is shown in figure 1.

Step one: regarding an original hierarchical subject database formed by organizing one or more data sets, focusing on important information such as subject classification, abstract text, keywords and the like, sampling the number and the frequency of each subject according to a predetermined hierarchical subject classification structure, and constructing a hierarchical subject structure sampling tree. The method comprises the steps of dividing samples with the same classification number into the same node, wherein each node corresponds to one sample set and one classification number; and determining the position of each node in the classification tree according to the classification number.

Each discipline classification is taken as a node of a hierarchical discipline structure sampling tree, each data sample is traversed firstly, the number of samples is sampled, each data sample is sequentially read from an original hierarchical discipline database, for example, one sample is classified as A= > A01= > A0102= > A010203 in the hierarchy, A is taken as a root node in the discipline, the rest of nodes are taken as sub-nodes of the previous node, sampling is carried out downwards from the root node according to the hierarchical discipline classification, the num attribute value of each node on a path is increased by 1, the number of the num attribute values of the A node, the A01 node, the A0102 node and the A010203 node are sequentially increased by 1, and the operation is continuously circulated until all samples in the original database finish the number of samples.

The frequency calculation is evaluated from multiple angles, including 1. The global frequency globalFreq is the ratio of the number of samples in the sample set corresponding to each node in the total number of samples, and reflects the number ratio of a certain node on the whole hierarchical subject structure sampling tree; 2. the hierarchical frequency level freq is the ratio of the number of samples in the sample set corresponding to each node in the number of samples corresponding to the nodes at the same layer, and reflects the ratio of the number of the nodes at the layer where the node is located; 3. father freq is the ratio of the sum of the sample numbers in the sample set corresponding to all nodes under the same father node in the total sample number, and reflects the number ratio among all nodes under the same father node, namely whether all sub-discipline classifications under a certain discipline are balanced; 4. leaf node frequency leaf freq is the ratio of the number of samples in the sample set corresponding to each leaf node to the total number of samples, and is used for performing enhancement effect analysis on the subject classification of some particularly few samples, only the subject leaf node leaf freq has a valid value, and all other non-leaf nodes are-1 and are not used for analysis. And (3) performing frequency calculation on the hierarchical subject structure sampling tree with the number of samples being completed, traversing each layer to a lower layer sequence according to the tree-shaped layer sequence structure, and performing calculation on each frequency on nodes of each layer. In the frequency calculation process, the calculation process is differentiated according to different layer times, and the unbalance degree of each discipline tree node is evaluated on the basis of the number and the frequency sampling.

Step two: and (3) determining and calculating the quantity of data enhancement required by each node through LLM according to the hierarchical subject structure sampling tree generated in the step one and the related sampling results thereof. The total number of samples to be enhanced is first determined, and according to the size of the original database, for example, the original database contains 10000 pieces of data, the total number of samples to be enhanced can be selectively generated according to the proportion of 0.5. In selecting a decision to generate a sample number decision, consider utilizationThe function better distributes the number of samples between discipline categories in terms of number and frequency. But compared to the frequency freq as z in the above formula _i The resulting list of times, 1/λ with the sample number results will make the discipline classifications smoother, while the purpose of adding-1 is to make the discipline classifications with smaller sample numbers get correspondingly larger enhancement numbers, therefore->

For the enhanced total number of samples N, a certain discipline C E C in all discipline classification sets C has a sampling number of N _c Then its data-augmented sample number isS generated for each node in turn _c An enhanced sample number list time is formed according to the hierarchical subject node order.

The super parameter lambda in the second step can be adjusted according to the actual effect. Judging whether the number of samples corresponding to each subject classification is balanced or not according to the updated frequency statistical information of the node after each adjustment, and if not, adjusting the value of the super parameter lambda and reassigning the number of enhanced samples corresponding to each subject classification; if balanced, go to step three.

Step three: according to the list time of the enhanced sample number generated in the second step, invoking a pre-trained ultra-large scale language model to perform time on a subject with a subject classification order of c in all subject classifications ^(c) The data enhancement is carried out, the prompting words are input each time and used as the input of the pre-trained ultra-large-scale language model, and the output of the prompting words is obtained and used as the data enhancement result. In order to avoid similar or even identical results obtained when the ultra-large scale language model (Large Language Model, LLM) data are continuously used for classifying the same subject, when the input prompt word template of the ultra-large scale language model is constructed in an experiment, an expert annotation keyword database is called to randomly select one keyword from keywords belonging to the subject as a part of the template to be constructed. Therefore, when the ultra-large scale language model is called for data enhancement, the template consists of background knowledge, a generation principle, a generation format, a language style, discipline and keywords, wherein only discipline and keywords are changed according to different discipline classifications, and other components of the prompting word template are basically determined for the enhancement task.

And (3) carrying out data enhancement by calling the ultra-large-scale language model to obtain a result, and completing the data labeling of the enhanced data according to the subject classification of the enhanced data so as to facilitate the subsequent training of the pre-trained BERT downstream classifier.

The invoked Pre-training models include BERT (Bidirectional Encoder Representations from Transformers) model, GPT (generated Pre-trained Transformer) model, LLaMA and its instruction fine-tuning version model Vicura, etc.

The selection of the pre-training model in the third step can be adjusted according to the actual effect.

And (3) experimental verification: and (3) using a pre-trained BERT model, connecting a plurality of full-connection layers to form a neural network for processing a downstream hierarchical subject reasoning task, combining hierarchical subject enhancement sample data generated according to the step (II) and an original subject database of the step (I), using a result of a abstract text after passing through a pre-trained word segmentation device as input, and outputting a poiler_output which is more suitable for processing a sentence-level task as the input of the downstream task neural network, training the neural network, and achieving the purpose of improving subject reasoning accuracy of subject text data by improving unbalance among subjects. In practical experiments, data augmentation is performed on a data set with a size of 14028, and when N is set to be 5% of the total number, the classification model can achieve 4% of F1Score effect improvement, and when N is set to be 10%, 10% of F1Score effect improvement can be obtained. The effectiveness of the method in an unbalanced data augmentation scene of a semantic tag-rich model aiming at a subject system is fully described.

The construction of the neural network in the experiment can be adjusted according to the actual effect, so that the effect of data enhancement on weakening of unbalance degree and thus the theme reasoning capability is better displayed.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be subject to the claims.

Claims

1. A semantic tag-rich data augmentation method based on a very large scale language model comprises the following steps:

2) According toDetermining a number of enhanced samples for each discipline classification; wherein N is the number of enhanced total samples, lambda is a super parameter, C is a subject classification set corresponding to the subject database, and C is one subject classification in the subject classification set C, namely C epsilon C; the number of samples of the node corresponding to discipline class c is n _c The number of samples of the node corresponding to the discipline classification i is n _i The number of enhanced samples corresponding to discipline class c is s _c ；

2. The method of claim 1, wherein generating a respective number of data samples for each subject class using the pre-trained very large scale language model for the corresponding subject class comprises: for discipline class c, the pre-trained very large scale language model is performed s _c And (3) enhancing the data, inputting the prompt word corresponding to the subject class c each time as the input of the pre-trained ultra-large scale language model, and obtaining the output of the pre-trained ultra-large scale language model as the data enhancement result.

3. The method according to claim 2, wherein for the subject class c, a keyword is randomly selected from keywords belonging to the subject class c in the expert annotation keyword database as a part of the prompt word, and the prompt word corresponding to the subject class c is constructed.

4. A method according to claim 3, wherein the cue words are constructed by a cue word template comprising background knowledge, generation principles, generation formats, language styles, disciplines, and keywords, wherein only disciplines and keywords vary depending on discipline classification.

5. A method according to claim 1, 2 or 3, wherein the very large scale language model is a BERT model, a GPT model or a LLaMA and its instruction fine tuning version model Vicuna.

6. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 5.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.