CN115829036B - Sample selection method and device for text knowledge reasoning model continuous learning - Google Patents
Sample selection method and device for text knowledge reasoning model continuous learning Download PDFInfo
- Publication number
- CN115829036B CN115829036B CN202310107542.8A CN202310107542A CN115829036B CN 115829036 B CN115829036 B CN 115829036B CN 202310107542 A CN202310107542 A CN 202310107542A CN 115829036 B CN115829036 B CN 115829036B
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- vector
- text
- knowledge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 40
- 238000004458 analytical method Methods 0.000 claims abstract description 22
- 238000005070 sampling Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 134
- 238000012549 training Methods 0.000 claims description 36
- 239000002344 surface layer Substances 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000006993 memory improvement Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
A sample selection method and device for text knowledge reasoning model continuous learning belongs to the technical field of natural language reasoning and comprises historical task sample selection and current task sample selection; wherein the historical task sample selection comprises: determining the number of samples selected to be added into the memory set; selecting a sample, namely selecting a memory set by measuring the sample through representative, differential and balance indexes, and traversing the sample by using one of the following two schemes when the sample is selected; the current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling. Compared with the prior art that a representative sample is selected based on a clustering center, the method can better adapt to complex text reasoning scenes, effectively uses a small number of samples to approximate the distribution of the original samples, and enables the model to memorize the learned knowledge on the historical task.
Description
Technical Field
The invention relates to a sample selection method and device for continuous learning of a text knowledge reasoning model, and belongs to the technical field of natural language reasoning.
Background
The natural language reasoning task refers to giving a precondition text and an assumed text, and judging different conditions such as correctness, mistakes or independence of the assumed text by taking the precondition text as a standard. Text knowledge reasoning is a special form of natural language reasoning task, where the premise text refers to knowledge points in the professional field or fact descriptions related to knowledge points in the professional field, and the hypothesis text refers to the different persons describing their understanding or cognitive results for knowledge points in the premise text. For example, in an economic examination, the precondition text refers to professional knowledge or related fact description in the field of economic law, and the reference answer of the corresponding test question is "when natural person stakeholders of a limited responsibility company change due to inheritance according to the rule of legal system of the company," other stakeholders claim to exercise priority to purchase right, people court is not supported, but the rules of the company are defined otherwise or all stakeholders agree on another rule. The "hypothesis text" refers to the understanding result of different people on the knowledge points, and corresponds to the answer of the examinee in the case, such as "Qian Mou requests to exercise limited purchase rights, and the national court is not supported. The stakeholder of the finite responsibility company inherits the equity by its inheritor. In this example, the text knowledge reasoning task is to judge whether the hypothesized text, i.e. the answer of the examinee, is correct according to the precondition text, i.e. the reference answer. The text knowledge reasoning has important application value in the fields of subjective question review, professional knowledge question and answer, knowledge reasoning and the like.
In the professional knowledge text reasoning problem, the category number of the knowledge points in the professional field is huge, the description forms of the knowledge points are various, the content and the form of the premise text are continuously updated, and the assumption text is various in form due to the close correlation with the individual professional knowledge level and the expression capability. The problems of the premise text and the hypothesis text make the sample confusion degree of describing the same knowledge point high and the identification difficult, and the low-frequency used knowledge point has a small number of corresponding samples, and the problem of lack of labeling samples exists in the cold expert knowledge point. In the face of continuously increasing knowledge point sample data, especially the knowledge points which are not involved in historical sample data, the intelligent model not only aims to solve sample challenges such as few samples and noise, but also aims to solve continuous learning challenges, namely learning new knowledge points without forgetting existing knowledge, and achieves the purposes of increasing generalization capability and robustness of the model.
Continuous learning is introduced to enable the text knowledge reasoning intelligent model to well complete new problems and process historical tasks with good performance. In the field of artificial intelligence, memory playback strategies are the most effective continuous learning methods, for example, the article published in 2019 by Wang, hong et al: "Sentence embedding alignment for lifelong relationship" arXiv preprint arXiv:1903.02588.
The aim of continuous learning is achieved by saving partial samples of the previous task to participate in the next training, wherein the set formed by the partial samples of the previous task is called a memory set, and the quality of the samples in the memory set determines the performance of the inference model on the historical task.
For example: the Chinese patent document CN114722892A provides a continuous learning method and device based on machine learning, wherein a historical data training generator is used, a pseudo sample set corresponding to a task is generated by the generator to serve as a memory set, and the quality of a generated sample is difficult to ensure by the method, so that the continuous learning effect is influenced.
The Chinese patent document CN113688882A proposes a training method and a training device of a continuous learning neural network model with enhanced memory, which are inspired by human brain memory playback, and an expandable memory module is constructed by a simple data playback method in a mode of storing the mean value and the variance of data, so that the memory enhancement effect of an original task is realized, but the scheme only considers the mode representative sample of a data set, and has no difficulty and diversity of the sample.
Chinese patent document CN113590958A discloses a continuous learning method of a sequence recommendation model based on sample playback, which samples a small portion of representative sample samples according to an item class balancing policy to generate a memory set, and this way does not consider the difficulty and the difference of the samples. In summary, the existing work is difficult to meet the continuous learning requirement of the text knowledge reasoning model.
In summary, the problems of the prior art include: aiming at the problems of various patterns and uneven quality of the sample describing the same knowledge; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category; when a sample added into a memory set is selected, the problem of high repeatability of the sample form or quality of the same knowledge point is described.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a sample selection method for continuous learning of a text knowledge reasoning model.
The invention also discloses a device for realizing the sample selection method.
Aiming at the problems, the invention provides a representative index for describing the problems of various sample patterns and uneven quality of the same knowledge sample; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category, a balance index is provided; in order to prevent the problem of high repeatability of the sample form or quality describing the same knowledge point in the samples added into the memory set, a differential index is provided. Based on the method, various selection strategies and sample selection technologies which give consideration to sample quality and sample feature distribution are provided, the model performance and robustness of continuous learning of professional knowledge text reasoning are improved, and the method has theoretical significance for other text understanding tasks.
Interpretation of technical terms
1. Expertise: references to the professional fields such as finance, law, accounting, etc., refer to the text of theory, technology, concepts, descriptions of facts, etc., as distinguished from the general knowledge and common sense knowledge.
2. Expertise point: the minimum composition unit of the professional knowledge adopts a normalized text description form, and is hereinafter referred to as knowledge points。
3. Precondition text: refers to a domain knowledge point or a description of facts relating to a domain knowledge point. The same knowledge point can have a plurality of premise texts for description, which is recorded as。
4. Assume text: refers to a textual description of the results of the understanding of the knowledge points of the expert by different persons. A precondition text may have a plurality of corresponding hypothesized texts, noted as。
5. Tasks: in model continuous learning, learning is performed from a series of tasks that have a time-series relationship, and model learning is performed on each task individually.
6. Data set: each task has its own data set, each sample of the data setIs shaped like +.>Of (2) wherein->For precondition text, < >>To assume text, ++>Wherein 0, 1, 2 respectively represent that the sample tag is implied, contradictory or neutral. The relationship of tasks to datasets is shown in figure 1.
7. Sentence-Bert model: refers to the model described in literature Reimers N, gurlev ch I, sentence-bert: sentence embeddings using siamesebert-networks [ J ]. ArXiv preprint arXiv:1908.10084, 2019.
8. Sentence transducer: is a code implementation of the Sentence-Bert model using the pytorch framework of python, currently without chinese translation.
The detailed technical scheme of the invention is as follows:
a sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps:
acquiring a center vector and a selected sample, wherein the center vector refers to the center vector of all samples describing the same knowledge point, and the selected sample is a selected proper sample added into a memory sample setIn (a) and (b);
when the center vector is obtained:
for the shape likeIs calculated by using the formula (I) (II)>And implicit feature center vector +.>, wherein />Representative dataset +.>Description of the invention->Knowledge pointsIs>Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
In the formulas (I) (II) and />Is->Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>Individual surface feature center vectors and +.>Implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (text feature) of a text: the surface features, expressed using word frequency and inverse document frequency, are noted asThe implicit feature, expressed using the Sentence-BERT vector, is denoted +.>, wherein />Is the encoded text;
when the sample is selected, the process comprises the following steps:
(1-1) determining the number of samples selected for addition to the memory set:
the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected and added into the memory set, and recording the current task as the first task according to the sequence of the tasksTask, history->The sample size selected in the data set of the individual tasks is +.>As formula (III), wherein ∈>The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task:
determining the ith from the data set of the ith task by formula (IV)The number of samples selected from the relevant samples of the knowledge points is +.>Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
wherein Indicate->A plurality of data sets; />Indicate->Description of the data set->A dataset of individual knowledge points;
(1-2) taking samples:
selecting a memory set by measuring samples according to representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when the samples are selected;
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps:
sample representative analysis, sample difficult analysis, and sample sampling; the flow is as follows;
in order to reduce training cost, for the current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of the overall samples, so that representative and difficult samples need to be extracted from the current task data set for training of a text knowledge reasoning model; in view of the purpose of screening samples to fine tune a text knowledge reasoning model, on one hand, the representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, and on the other hand, a difficult sample needs to be selected, and can carry more information which can benefit the model, so that the invention combines the representativeness and the difficulty to perform sample screening on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (thPersonal task) data set>Samples in a candidate sample set of knowledge pointsIs a set of hypothesized text surface feature vectors +.>Go->Clustering, designating the number of clusters as +.>,/>Can be determined according to the number of knowledge points of the previous task, namely, the number similar to the number of knowledge points of the previous task is selected, and the +.>Personal clusters->The present task is +.>The number of samples to be extracted for each knowledge point is +.>For each cluster +.>Calculating sample variance in clusters>Sample number in cluster +.>To analyze sample representatives: variance->A cluster of large and large sample number, wherein each sample is less representative of the cluster, more samples need to be sampled from the cluster to maintain the representative of the cluster, determining the cluster according to formula (V)>Sample number of middle samples +.>,/>Meaning +.>In the task, for->First->The number of samples selected from the clusters:
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning modelsInput sample->Precondition text +. >And assume text +.>Carrying out reasoning prediction to obtain a category set +.>Upper predictive probability distribution->,
, wherein />Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>Difficulty (S)>,
,/>The smaller the expression inference model is, the more difficult the sample is, because the inference model is not confidence in its predicted outcome, where +.>Representing a professional text reasoning model->Prediction sample category +.>Probability of->Representation is made of a specialized textual inference modelPredicting the probability of the sample category c;
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task datasetThe candidate sample set of the knowledge points is subjected to the step (2-1)Representative analysis of the samples, resulting in +.>Sample number +.>To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculatedDifficulty quantified value of->For cluster->Sampling from them the most difficult +.>Samples, i.e. according to +.>The value is chosen from small to large +. >Adding the sample into small sample set, and adding the sample into the sample set to be screened>All the above sampling processes are carried out to finish knowledge point +.>Sampling samples, finishing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening the training sample set number of the current task to be +.>。
According to the invention, in the step (1), the specific method for acquiring the surface features of the text is as follows:
the text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
According to the present invention, in step (1), the implicit feature specific method for acquiring the text is preferably as follows:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
Preferably, in step (1-2), the method for selecting the memory set by measuring the sample through the representative, differential and balance indexes comprises the following steps:
For a series of tasksWith corresponding data sets on each task, wherein />Representing the task currently to be learned, +.>Representing its corresponding dataset, each sample in the dataset being marked +.>When the text knowledge reasoning model learns on the current task, the memory set +.>Performing memory playback, wherein->For the set union operation, +.>The operation scope is all tasks before the current task, i.e. +.>;/>Representing data set +.>A set of samples added to the memory set, < >>Is memory set->Is abbreviated as (1); said collection->Satisfying representativeness, variability, and balance;
the representatives, measured using local outlier factors (Local Outlier Factor, LOF), are: because the individual expertise levels and expressions are different, the quality of the samples describing the same knowledge point is uneven and the description forms are various, so that the samples added into the memory set can represent the different quality and the form of all the samples describing the knowledge point corresponding to the samples;
the difference, for the sample、/>Difference between using surface layer feature vector and implicit feature vector +.>The distance is measured, and in order to achieve the model robustness target, a variety of samples are selected to be put into a memory set I.e., there is a difference between the sample characteristics;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
According to a preferred embodiment of the present invention, the representative, specific method of measurement using Local outlier factor (Local OutlierFactor, LOF):
equation (VIII) is used to calculate the distance between two vectors:
the saidIs two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>Vector->;
wherein ,is the pointing quantity->Is>A set of all vectors within a distance, wherein the vector +.>Is>Distance means vector +.>Achieve->Distance of the individual neighboring vectors;
when vector isIs->When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
When vector isIs->When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic spaceWhen the LOF value is smaller, +.>The larger it is, the more representative is: from the above description of the cases where the LOF value is 1 or less and 1 or more, it is known that the larger the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the less outliers the corresponding vector, the more likely the vector is normal;
wherein ,is an adjustable parameter, ++>For indicating the relative importance of the surface features, if the surface features are important, increasing +.>The method comprises the steps of carrying out a first treatment on the surface of the If the implicit feature is important, then decrease +.>Default value is 0.5; the division of formula (XII) is to divide the sample distribution by +.>Always remain in a similar range.
According to a preferred embodiment of the present invention, the variability, for a sample、/>The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
As shown in formula (XIII) (XIV) in whichIs a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples:
the candidate sample is selected for memory set when it satisfies equation (XIII) (XIV), which is a step of determining whether there is a difference between the candidate sample and the sample in memory set.
According to a preferred embodiment of the present invention, the balance is selected from a set of samples added to the memory setCharacteristic distribution and raw data set->The feature distribution is approximate:
wherein ,representative parameter is->Is a memory set sample probability distribution +.>Is->Shorthand for->Should be considered as a whole; />Probability distribution for the original data set samples; />,/>Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->Accompanying a specific parameter which determines its distribution>Therefore, expressed as +.>;/>Is the basic operation of collective operations; />Representing the sample; />Representing descriptive knowledge points->Is a sample of (2); />A set of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the definition of balance is met.
In the above process, the super ginseng isThe user can use the default value and can also customize to meet the actual business requirements. The invention can automatically select the history samples to be added into the memory set according to the established algorithm for the subsequent training of the expertise point-oriented natural language reasoning model.
An apparatus for implementing the sample selection method, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
the center vector calculating module is used for calculating the center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples and used for selecting the subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by utilizing an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the current task training by utilizing the complete memory set.
A computer device for implementing the sample selection method, characterized in that: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.
The invention has the technical advantages that:
compared with the prior art that the representative sample is selected based on the clustering center, the continuous learning sample selection method for the text knowledge reasoning model can be better adapted to complex text reasoning scenes, and a small number of samples are effectively used for approximating the distribution of the original samples, so that the model memorizes the learned knowledge on the historical task.
According to the invention, the text knowledge reasoning model can be finely tuned and optimized according to the properties and the screened samples of the high-quality memory set sample provided by the practical problem, the model can be better helped to memorize the existing task, and meanwhile, the model is helped to train the current task, so that the robustness of the model in practical use is effectively increased.
In addition, the invention can also be used on other similar tasks in the field of natural language processing, such as continuous learning tasks based on scene memory playback, such as knowledge questions and answers, text classification and the like.
Drawings
FIG. 1 is a schematic diagram of task versus dataset for text-oriented knowledge reasoning in accordance with the present invention;
FIG. 2 is a flow chart of a method for selecting a continuous learning history sample based on text knowledge reasoning;
FIG. 3 is a schematic diagram of a sample selection flow of a text knowledge reasoning-oriented continuous learning history sample selection method in accordance with the present invention;
fig. 4 is a flow chart of a method for continuously learning a current sample selection, which is directed to text knowledge reasoning in the present invention.
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.
Example 1,
As shown in fig. 1 and fig. 2, a sample selection method for continuous learning of a text knowledge inference model includes: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps:
the center vector is obtained by adding the selected sample to the memory sample set, and the selected sample is obtained by obtaining the center vector of all samples describing the same knowledge point, as shown in FIG. 2In (a) and (b);
when the center vector is obtained:
for the shape likeIs calculated by using the formula (I) (II)>And implicit feature center vector +.>, wherein />Representative dataset +.>Description of the invention->Knowledge Point->Is>Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
In the formulas (I) (II) and />Is->Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>Individual surface feature center vectors and +.>Implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (text feature) of a text: the surface features, expressed using word frequency and inverse document frequency, are noted asThe implicit feature, expressed using the Sentence-BERT vector, is denoted +.>, wherein />Is the encoded text;
when the sample is selected, the flow is as shown in fig. 3, and includes:
(1-1) determining the number of samples selected to be added to the memory set:
the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected to be added into the memory set, and recording the current task as the first task according to the sequence of the tasksTask, history->The sample size selected in the data set of the individual tasks is +.>As formula (III), wherein ∈>The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task:
determined from equation (IV)Task data set->The number of samples selected from the relevant samples of the knowledge points is +.>Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
wherein Indicate->A plurality of data sets; />Indicate->Description of the data set->A dataset of individual knowledge points;
(1-2) taking samples:
selecting a memory set by measuring samples according to representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when the samples are selected;
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps:
sample representative analysis, sample difficult analysis, and sample sampling; the flow is shown in fig. 4;
in order to reduce training cost, for the current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of the overall samples, so that representative and difficult samples need to be extracted from the current task data set for training of a text knowledge reasoning model; in view of the purpose of screening samples to fine tune a text knowledge reasoning model, on one hand, the representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, and on the other hand, a difficult sample needs to be selected, and can carry more information which can benefit the model, so that the invention combines the representativeness and the difficulty to perform sample screening on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (thPersonal task) data set>Samples in a candidate sample set of knowledge pointsIs a set of hypothesized text surface feature vectors +.>Go->Clustering, designating the number of clusters as +.>,/>Can be determined according to the number of knowledge points of the previous task, namely, the number similar to the number of knowledge points of the previous task is selected, and the +.>Personal clusters->The present task is +.>The number of samples to be extracted for each knowledge point is +.>For each cluster +.>Calculating sample variance in clusters>Sample number in cluster +.>To analyze sample representatives: variance->A cluster of large and large sample number, wherein each sample is less representative of the cluster, more samples need to be sampled from the cluster to maintain the representative of the cluster, determining the cluster according to formula (V)>Sample number of middle samples +.>,/>Meaning +.>In the task, for->First->The number of samples selected from the clusters:
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning modelsInput sample->Precondition text +. >And assume text +.>Carrying out reasoning prediction to obtain a category set +.>Upper predictive probability distribution->,
Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>Difficulty (S)>,/>The smaller the representation the inference model is not confidence in its predicted outcome, so the more difficult the sample is: wherein->Representing a professional text reasoning model->Prediction sample category +.>Probability of->Representing the text reasoning model by profession->Predicting the probability of the sample category c;
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task datasetPerforming the sample representative analysis described in step (2-1) on the candidate sample set of knowledge points to obtain from each cluster +.>Sample number +.>To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculatedDifficulty quantified value of->For cluster->Sampling from them the most difficult +.>Samples, i.e. according to +.>The value is chosen from small to large +. >Adding the sample into small sample set, and adding the sample into the sample set to obtain the sample>All the above sampling processes are carried out to finish knowledge point +.>Sampling samples, finishing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening the training sample set number of the current task to be +.>。
EXAMPLE 2,
The embodiment 1 of a sample selection method for continuous learning of a text knowledge-oriented reasoning model, in step (1), a specific method for obtaining surface features of the text is as follows:
the text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
In the step (1), the specific method for acquiring the implicit characteristic of the text comprises the following steps:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
EXAMPLE 3,
The method for selecting a memory set by measuring samples for representative, differential and balance indexes in step (1-2) according to the sample selection method for continuous learning of a text knowledge inference model in embodiments 1 and 2 comprises the following steps:
For a series of tasksWith corresponding data sets on each task, wherein />Representing the task currently to be learned, +.>Representing its corresponding dataset, each sample in the dataset being marked +.>When the text knowledge reasoning model learns on the current task, the memory set +.>Performing memory playback, wherein->For the set union operation, +.>The operation scope is all tasks before the current task, i.e. +.>;/>Representing data set +.>A set of samples added to the memory set, < +.>Is memory set->Is abbreviated as (1); said collection->Satisfying representativeness, variability, and balance;
the representatives, measured using local outlier factors (Local Outlier Factor, LOF), are: because the individual expertise levels and expressions are different, the quality of the samples describing the same knowledge point is uneven and the description forms are various, so that the samples added into the memory set can represent the different quality and the form of all the samples describing the knowledge point corresponding to the samples;
the difference, for the sampleDifference between using surface layer feature vector and implicit feature vector +.>The distance is measured, in order to achieve the model robustness target, a plurality of samples are selected to be put into a memory set, namely, the sample features have differences;
The balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
The representative, specific method of measurement using local outlier factors (Local Outlier Factor, LOF):
equation (VIII) is used to calculate the distance between two vectors:
the saidIs two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>Vector->;
wherein ,is the pointing quantity->Is>A set of all vectors within a distance, wherein the vector +.>Is>Distance means vector +.>Achieve->Distance of the individual neighboring vectors;
when vector isIs->When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
When vector isIs->When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic spaceWhen the LOF value is smaller, +.>The larger it is, the more representative is: from the above description of the cases where the LOF value is 1 or less and 1 or more, it is known that the larger the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the less outliers the corresponding vector, the more likely the vector is normal;
wherein ,is an adjustable parameter, ++>For indicating the relative importance of the surface features, if the surface features are important, increasing +.>The method comprises the steps of carrying out a first treatment on the surface of the If the implicit feature is important, then decrease +.>Default value is 0.5; the division of formula (XII) is to divide the sample distribution by +.>Always remain in a similar range.
The difference, for the sampleThe difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
As shown in formula (XIII) (XIV) in whichIs a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples: />
The candidate sample is selected for memory set when it satisfies equation (XIII) (XIV), which is a step of determining whether there is a difference between the candidate sample and the sample in memory set.
The balance, sample set selected to be added to the memory setCharacteristic distribution and raw data set->The feature distribution is approximate:
wherein ,representative parameter is->Is a memory set sample probability distribution +.>Is->Shorthand for->Should be considered as a whole; />Probability distribution for the original data set samples; />,/>Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->Accompanying a specific parameter which determines its distribution>Therefore, expressed as +.>;/>Is the basic operation of collective operations; />Representing the sample; />Representing descriptive knowledge points->Is a sample of (2); />A set of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the definition of balance is met.
In the above process, the super ginseng isThe user can use the default value and can also customize to meet the actual business requirements. The invention can automatically select the history samples to be added into the memory set according to the established algorithm for the subsequent training of the expertise point-oriented natural language reasoning model.
EXAMPLE 5,
An apparatus for implementing the sample selection method of embodiments 1-4, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
the center vector calculating module is used for calculating the center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples and used for selecting the subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by utilizing an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the current task training by utilizing the complete memory set.
EXAMPLE 6,
A computer apparatus implementing the sample selection method of embodiments 1-4, comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.
Claims (8)
1. A sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps: obtaining a center vector and selecting a sample;
when the center vector is obtained: for the shape likeIs calculated by using the formula (I) (II)>And implicit feature center vector +.>, wherein />Representative dataset +.>Description of the invention->Knowledge Point->Is>Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
in the formulas (I) (II) and />Is->Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>Individual surface feature center vectors and +.>Implicit feature center vectors;
acquiring surface layer characteristics and implicit characteristics of a text: the surface features, expressed using word frequency and inverse document frequency, are noted asThe implicit feature, expressed using the Sentence-BERT vector, is denoted +.>, wherein />Is the encoded text;
the sample selecting process includes:
(1-1) determining the number of samples selected to be added to the memory set:
(1-2) selecting samples, namely selecting a memory set by measuring samples by using representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when selecting the samples,
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling;
(2-1) sample representative analysis in the current task sample selection;
(2-2) sample difficulty analysis in sample selection of the current task;
(2-3) sampling samples in the sample selection of the current task;
in step (1):
the (1-1) determines the number of samples selected to be added into the memory set:
recording the current task as the first according to the sequence of the tasksTask, history- >The sample size selected in the data set of the individual tasks is +.>As formula (III), wherein ∈>The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task: />
Determining the ith from the data set of the ith task by formula (IV)The number of samples selected from the relevant samples of the knowledge points is +.>Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
wherein Indicate->A plurality of data sets; />Indicate->Description of the data set->A dataset of individual knowledge points;
in step (2), the sample representative analysis in the current task sample selection (2-1) includes:
for the current task datasetSample in candidate sample set of knowledge points +.>Is a set of hypothesized text surface feature vectors +.>Go->Clustering, designating the number of clusters as +.>According to formula (VI), get +.>Personal clusters->The present task is +.>The number of samples required to be extracted from each knowledge point isFor each cluster +.>Calculating sample variance in clusters>Number of samples in clusterTo analyze sample representatives: determining the cluster according to formula (V)>Sample number of middle samples +.>,/>Meaning of the first In the task, for->First->The number of samples selected from the clusters:
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning modelsInput sample->Precondition text +.>And assume text +.>Carrying out reasoning prediction to obtain a category set +.>Upper predictive probability distribution->,
Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>Difficulty (S)>: wherein />Representing a professional text reasoning model->Prediction sample category +.>Probability of->Representing the text reasoning model by profession->Predicting the probability of the sample category c;
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task datasetPerforming the sample representative analysis described in step (2-1) on the candidate sample set of knowledge points to obtain from each cluster +.>Sample number +.>To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculated Difficulty quantified value of->For cluster->Sampling from them the most difficult +.>Samples, i.e. according to +.>The value is chosen from small to large +.>Adding the sample into small sample set, and adding the sample into the sample set to obtain the sample>All the above sampling processes are carried out to complete knowledge pointsSampling the sample, and completing the current process of all knowledge points in the current taskData screening on task, finally screening training sample set number of current task to be +.>;
In step (1-2), a method for selecting a memory set from a representative, differential, balance index measurement sample, comprising:
the representativeness is measured using a local outlier factor;
the difference, for the sampleDifference between using surface layer feature vector and implicit feature vector +.>Measuring the distance;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
2. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein in step (1), the specific method for obtaining the surface features of the text is as follows:
The text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
3. The sample selection method for continuous learning of a text-oriented knowledge reasoning model according to claim 1, wherein in step (1), the implicit feature concrete method for obtaining the text is as follows:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
4. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein the representative, specific method for measuring by using local outlier factors:
equation (VIII) is used to calculate the distance between two vectors:
the saidIs two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>Vector->;
wherein ,is the pointing quantity->Is>A set of all vectors within a distance, wherein the vector +.>Is>Distance means vector +.>Achieve->Distance of the individual neighboring vectors;
when vector isIs->When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
when vector isIs->When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic spaceWhen the LOF value is smaller, +.>The larger;
5. The sample selection method for text knowledge oriented reasoning model continuous learning of claim 1 wherein the variability, for samples The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
candidate samples satisfy equation (XIII) (XIV) and are entered into the selection memory set.
6. The method for sample selection for text-oriented knowledge reasoning model continuous learning as claimed in claim 1, wherein said balance is selected from a set of samples added to a set of memoriesCharacteristic distribution and raw data set->The feature distribution is approximate:
wherein ,representative parameter is->Is a memory set sample probability distribution +.>Is->Shorthand for->Should be considered as a whole; />Probability distribution for the original data set samples; />,/>Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->Accompanied by a specific determination of its distributionParameter->Therefore, expressed as +.>;/>Is the basic operation of collective operations; />Representing the sample; />Representing descriptive knowledge points->Is a sample of (2); />Is a set of knowledge points.
7. An apparatus for implementing the sample selection method of any one of claims 1-6, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
The center vector calculating module is used for calculating center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples;
the sample selection model selects a proper sample and adds the proper sample into the memory set, and finally a complete memory set is obtained;
and the training module is used for assisting the current task training by utilizing the complete memory set.
8. A computer device implementing the sample selection method of any of claims 1-6, characterized by: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310107542.8A CN115829036B (en) | 2023-02-14 | 2023-02-14 | Sample selection method and device for text knowledge reasoning model continuous learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310107542.8A CN115829036B (en) | 2023-02-14 | 2023-02-14 | Sample selection method and device for text knowledge reasoning model continuous learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115829036A CN115829036A (en) | 2023-03-21 |
CN115829036B true CN115829036B (en) | 2023-05-05 |
Family
ID=85521149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310107542.8A Active CN115829036B (en) | 2023-02-14 | 2023-02-14 | Sample selection method and device for text knowledge reasoning model continuous learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115829036B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021022572A1 (en) * | 2019-08-07 | 2021-02-11 | 南京智谷人工智能研究院有限公司 | Active sampling method based on meta-learning |
CN112966115A (en) * | 2021-05-18 | 2021-06-15 | 东南大学 | Active learning event extraction method based on memory loss prediction and delay training |
WO2021184311A1 (en) * | 2020-03-19 | 2021-09-23 | 中山大学 | Method and apparatus for automatically generating inference questions and answers |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130097103A1 (en) * | 2011-10-14 | 2013-04-18 | International Business Machines Corporation | Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set |
US9477654B2 (en) * | 2014-04-01 | 2016-10-25 | Microsoft Corporation | Convolutional latent semantic models and their applications |
CN112347268B (en) * | 2020-11-06 | 2024-03-19 | 华中科技大学 | Text-enhanced knowledge-graph combined representation learning method and device |
CN115563315A (en) * | 2022-12-01 | 2023-01-03 | 东南大学 | Active complex relation extraction method for continuous few-sample learning |
CN115618045B (en) * | 2022-12-16 | 2023-03-14 | 华南理工大学 | Visual question answering method, device and storage medium |
-
2023
- 2023-02-14 CN CN202310107542.8A patent/CN115829036B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021022572A1 (en) * | 2019-08-07 | 2021-02-11 | 南京智谷人工智能研究院有限公司 | Active sampling method based on meta-learning |
WO2021184311A1 (en) * | 2020-03-19 | 2021-09-23 | 中山大学 | Method and apparatus for automatically generating inference questions and answers |
CN112966115A (en) * | 2021-05-18 | 2021-06-15 | 东南大学 | Active learning event extraction method based on memory loss prediction and delay training |
Also Published As
Publication number | Publication date |
---|---|
CN115829036A (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Interactive genetic algorithms with large population and semi-supervised learning | |
CN113128369B (en) | Lightweight network facial expression recognition method fusing balance loss | |
CN109993236A (en) | Few sample language of the Manchus matching process based on one-shot Siamese convolutional neural networks | |
CN110826639B (en) | Zero sample image classification method trained by full data | |
CN109858015A (en) | A kind of semantic similarity calculation method and device based on CTW and KM algorithm | |
Kamada et al. | An adaptive learning method of restricted Boltzmann machine by neuron generation and annihilation algorithm | |
Callan et al. | Self-organizing map for the classification of normal and disordered female voices | |
CN110349597A (en) | A kind of speech detection method and device | |
CN116821698B (en) | Wheat scab spore detection method based on semi-supervised learning | |
CN106448660B (en) | It is a kind of introduce big data analysis natural language smeared out boundary determine method | |
CN116805533A (en) | Cerebral hemorrhage operation risk prediction system based on data collection and simulation | |
Wang et al. | Soft focal loss: Evaluating sample quality for dense object detection | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
CN113330462A (en) | Neural network training using soft nearest neighbor loss | |
Liu et al. | Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism. | |
CN115829036B (en) | Sample selection method and device for text knowledge reasoning model continuous learning | |
Zhan | A convolutional network-based intelligent evaluation algorithm for the quality of spoken English pronunciation | |
Lee et al. | Learning non-homogenous textures and the unlearning problem with application to drusen detection in retinal images | |
Shukla et al. | A novel stochastic deep conviction network for emotion recognition in speech signal | |
Wang et al. | Intelligent radar HRRP target recognition based on CNN-BERT model | |
Gumelar et al. | Transformer-CNN automatic hyperparameter tuning for speech emotion recognition | |
Perez et al. | Face Patches Designed through Neuroevolution for Face Recognition with Large Pose Variation | |
CN111523649A (en) | Method and device for preprocessing data aiming at business model | |
Zhou et al. | Research on intelligent diagnosis algorithm of diseases based on machine learning | |
CN116824237A (en) | Image recognition and classification method based on two-stage active learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |