CN115829036B - Sample selection method and device for text knowledge reasoning model continuous learning - Google Patents

Sample selection method and device for text knowledge reasoning model continuous learning Download PDF

Info

Publication number
CN115829036B
CN115829036B CN202310107542.8A CN202310107542A CN115829036B CN 115829036 B CN115829036 B CN 115829036B CN 202310107542 A CN202310107542 A CN 202310107542A CN 115829036 B CN115829036 B CN 115829036B
Authority
CN
China
Prior art keywords
sample
samples
vector
text
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310107542.8A
Other languages
Chinese (zh)
Other versions
CN115829036A (en
Inventor
孙宇清
杨磊稳
马磊
杨涛
袁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310107542.8A priority Critical patent/CN115829036B/en
Publication of CN115829036A publication Critical patent/CN115829036A/en
Application granted granted Critical
Publication of CN115829036B publication Critical patent/CN115829036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A sample selection method and device for text knowledge reasoning model continuous learning belongs to the technical field of natural language reasoning and comprises historical task sample selection and current task sample selection; wherein the historical task sample selection comprises: determining the number of samples selected to be added into the memory set; selecting a sample, namely selecting a memory set by measuring the sample through representative, differential and balance indexes, and traversing the sample by using one of the following two schemes when the sample is selected; the current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling. Compared with the prior art that a representative sample is selected based on a clustering center, the method can better adapt to complex text reasoning scenes, effectively uses a small number of samples to approximate the distribution of the original samples, and enables the model to memorize the learned knowledge on the historical task.

Description

Sample selection method and device for text knowledge reasoning model continuous learning
Technical Field
The invention relates to a sample selection method and device for continuous learning of a text knowledge reasoning model, and belongs to the technical field of natural language reasoning.
Background
The natural language reasoning task refers to giving a precondition text and an assumed text, and judging different conditions such as correctness, mistakes or independence of the assumed text by taking the precondition text as a standard. Text knowledge reasoning is a special form of natural language reasoning task, where the premise text refers to knowledge points in the professional field or fact descriptions related to knowledge points in the professional field, and the hypothesis text refers to the different persons describing their understanding or cognitive results for knowledge points in the premise text. For example, in an economic examination, the precondition text refers to professional knowledge or related fact description in the field of economic law, and the reference answer of the corresponding test question is "when natural person stakeholders of a limited responsibility company change due to inheritance according to the rule of legal system of the company," other stakeholders claim to exercise priority to purchase right, people court is not supported, but the rules of the company are defined otherwise or all stakeholders agree on another rule. The "hypothesis text" refers to the understanding result of different people on the knowledge points, and corresponds to the answer of the examinee in the case, such as "Qian Mou requests to exercise limited purchase rights, and the national court is not supported. The stakeholder of the finite responsibility company inherits the equity by its inheritor. In this example, the text knowledge reasoning task is to judge whether the hypothesized text, i.e. the answer of the examinee, is correct according to the precondition text, i.e. the reference answer. The text knowledge reasoning has important application value in the fields of subjective question review, professional knowledge question and answer, knowledge reasoning and the like.
In the professional knowledge text reasoning problem, the category number of the knowledge points in the professional field is huge, the description forms of the knowledge points are various, the content and the form of the premise text are continuously updated, and the assumption text is various in form due to the close correlation with the individual professional knowledge level and the expression capability. The problems of the premise text and the hypothesis text make the sample confusion degree of describing the same knowledge point high and the identification difficult, and the low-frequency used knowledge point has a small number of corresponding samples, and the problem of lack of labeling samples exists in the cold expert knowledge point. In the face of continuously increasing knowledge point sample data, especially the knowledge points which are not involved in historical sample data, the intelligent model not only aims to solve sample challenges such as few samples and noise, but also aims to solve continuous learning challenges, namely learning new knowledge points without forgetting existing knowledge, and achieves the purposes of increasing generalization capability and robustness of the model.
Continuous learning is introduced to enable the text knowledge reasoning intelligent model to well complete new problems and process historical tasks with good performance. In the field of artificial intelligence, memory playback strategies are the most effective continuous learning methods, for example, the article published in 2019 by Wang, hong et al: "Sentence embedding alignment for lifelong relationship" arXiv preprint arXiv:1903.02588.
The aim of continuous learning is achieved by saving partial samples of the previous task to participate in the next training, wherein the set formed by the partial samples of the previous task is called a memory set, and the quality of the samples in the memory set determines the performance of the inference model on the historical task.
For example: the Chinese patent document CN114722892A provides a continuous learning method and device based on machine learning, wherein a historical data training generator is used, a pseudo sample set corresponding to a task is generated by the generator to serve as a memory set, and the quality of a generated sample is difficult to ensure by the method, so that the continuous learning effect is influenced.
The Chinese patent document CN113688882A proposes a training method and a training device of a continuous learning neural network model with enhanced memory, which are inspired by human brain memory playback, and an expandable memory module is constructed by a simple data playback method in a mode of storing the mean value and the variance of data, so that the memory enhancement effect of an original task is realized, but the scheme only considers the mode representative sample of a data set, and has no difficulty and diversity of the sample.
Chinese patent document CN113590958A discloses a continuous learning method of a sequence recommendation model based on sample playback, which samples a small portion of representative sample samples according to an item class balancing policy to generate a memory set, and this way does not consider the difficulty and the difference of the samples. In summary, the existing work is difficult to meet the continuous learning requirement of the text knowledge reasoning model.
In summary, the problems of the prior art include: aiming at the problems of various patterns and uneven quality of the sample describing the same knowledge; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category; when a sample added into a memory set is selected, the problem of high repeatability of the sample form or quality of the same knowledge point is described.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a sample selection method for continuous learning of a text knowledge reasoning model.
The invention also discloses a device for realizing the sample selection method.
Aiming at the problems, the invention provides a representative index for describing the problems of various sample patterns and uneven quality of the same knowledge sample; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category, a balance index is provided; in order to prevent the problem of high repeatability of the sample form or quality describing the same knowledge point in the samples added into the memory set, a differential index is provided. Based on the method, various selection strategies and sample selection technologies which give consideration to sample quality and sample feature distribution are provided, the model performance and robustness of continuous learning of professional knowledge text reasoning are improved, and the method has theoretical significance for other text understanding tasks.
Interpretation of technical terms
1. Expertise: references to the professional fields such as finance, law, accounting, etc., refer to the text of theory, technology, concepts, descriptions of facts, etc., as distinguished from the general knowledge and common sense knowledge.
2. Expertise point: the minimum composition unit of the professional knowledge adopts a normalized text description form, and is hereinafter referred to as knowledge points
Figure SMS_1
3. Precondition text: refers to a domain knowledge point or a description of facts relating to a domain knowledge point. The same knowledge point can have a plurality of premise texts for description, which is recorded as
Figure SMS_2
4. Assume text: refers to a textual description of the results of the understanding of the knowledge points of the expert by different persons. A precondition text may have a plurality of corresponding hypothesized texts, noted as
Figure SMS_3
5. Tasks: in model continuous learning, learning is performed from a series of tasks that have a time-series relationship, and model learning is performed on each task individually.
6. Data set: each task has its own data set, each sample of the data set
Figure SMS_4
Is shaped like +.>
Figure SMS_5
Of (2) wherein->
Figure SMS_6
For precondition text, < >>
Figure SMS_7
To assume text, ++>
Figure SMS_8
Wherein 0, 1, 2 respectively represent that the sample tag is implied, contradictory or neutral. The relationship of tasks to datasets is shown in figure 1.
7. Sentence-Bert model: refers to the model described in literature Reimers N, gurlev ch I, sentence-bert: sentence embeddings using siamesebert-networks [ J ]. ArXiv preprint arXiv:1908.10084, 2019.
8. Sentence transducer: is a code implementation of the Sentence-Bert model using the pytorch framework of python, currently without chinese translation.
The detailed technical scheme of the invention is as follows:
a sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps:
acquiring a center vector and a selected sample, wherein the center vector refers to the center vector of all samples describing the same knowledge point, and the selected sample is a selected proper sample added into a memory sample set
Figure SMS_9
In (a) and (b);
when the center vector is obtained:
for the shape like
Figure SMS_11
Is calculated by using the formula (I) (II)>
Figure SMS_14
And implicit feature center vector +.>
Figure SMS_16
, wherein />
Figure SMS_12
Representative dataset +.>
Figure SMS_13
Description of the invention->
Figure SMS_15
Knowledge points
Figure SMS_17
Is>
Figure SMS_10
Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
Figure SMS_18
In the formulas (I) (II)
Figure SMS_19
and />
Figure SMS_20
Is->
Figure SMS_21
Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>
Figure SMS_22
Individual surface feature center vectors and +.>
Figure SMS_23
Implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (text feature) of a text: the surface features, expressed using word frequency and inverse document frequency, are noted as
Figure SMS_24
The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>
Figure SMS_25
, wherein />
Figure SMS_26
Is the encoded text;
when the sample is selected, the process comprises the following steps:
(1-1) determining the number of samples selected for addition to the memory set:
the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected and added into the memory set, and recording the current task as the first task according to the sequence of the tasks
Figure SMS_27
Task, history->
Figure SMS_28
The sample size selected in the data set of the individual tasks is +.>
Figure SMS_29
As formula (III), wherein ∈>
Figure SMS_30
The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task:
Figure SMS_31
determining the ith from the data set of the ith task by formula (IV)
Figure SMS_32
The number of samples selected from the relevant samples of the knowledge points is +.>
Figure SMS_33
Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
Figure SMS_34
/>
wherein
Figure SMS_35
Indicate->
Figure SMS_36
A plurality of data sets; />
Figure SMS_37
Indicate->
Figure SMS_38
Description of the data set->
Figure SMS_39
A dataset of individual knowledge points;
(1-2) taking samples:
selecting a memory set by measuring samples according to representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when the samples are selected;
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps:
sample representative analysis, sample difficult analysis, and sample sampling; the flow is as follows;
in order to reduce training cost, for the current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of the overall samples, so that representative and difficult samples need to be extracted from the current task data set for training of a text knowledge reasoning model; in view of the purpose of screening samples to fine tune a text knowledge reasoning model, on one hand, the representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, and on the other hand, a difficult sample needs to be selected, and can carry more information which can benefit the model, so that the invention combines the representativeness and the difficulty to perform sample screening on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (th
Figure SMS_51
Personal task) data set>
Figure SMS_42
Samples in a candidate sample set of knowledge points
Figure SMS_47
Is a set of hypothesized text surface feature vectors +.>
Figure SMS_55
Go->
Figure SMS_58
Clustering, designating the number of clusters as +.>
Figure SMS_57
,/>
Figure SMS_60
Can be determined according to the number of knowledge points of the previous task, namely, the number similar to the number of knowledge points of the previous task is selected, and the +.>
Figure SMS_48
Personal clusters->
Figure SMS_53
The present task is +.>
Figure SMS_40
The number of samples to be extracted for each knowledge point is +.>
Figure SMS_44
For each cluster +.>
Figure SMS_43
Calculating sample variance in clusters>
Figure SMS_45
Sample number in cluster +.>
Figure SMS_49
To analyze sample representatives: variance->
Figure SMS_52
A cluster of large and large sample number, wherein each sample is less representative of the cluster, more samples need to be sampled from the cluster to maintain the representative of the cluster, determining the cluster according to formula (V)>
Figure SMS_50
Sample number of middle samples +.>
Figure SMS_54
,/>
Figure SMS_56
Meaning +.>
Figure SMS_59
In the task, for->
Figure SMS_41
First->
Figure SMS_46
The number of samples selected from the clusters:
Figure SMS_61
Figure SMS_62
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning models
Figure SMS_63
Input sample->
Figure SMS_64
Precondition text +. >
Figure SMS_65
And assume text +.>
Figure SMS_66
Carrying out reasoning prediction to obtain a category set +.>
Figure SMS_67
Upper predictive probability distribution->
Figure SMS_68
,
Figure SMS_69
, wherein />
Figure SMS_70
Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>
Figure SMS_71
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>
Figure SMS_72
Difficulty (S)>
Figure SMS_73
,
Figure SMS_74
,/>
Figure SMS_75
The smaller the expression inference model is, the more difficult the sample is, because the inference model is not confidence in its predicted outcome, where +.>
Figure SMS_76
Representing a professional text reasoning model->
Figure SMS_77
Prediction sample category +.>
Figure SMS_78
Probability of->
Figure SMS_79
Representation is made of a specialized textual inference model
Figure SMS_80
Predicting the probability of the sample category c;
Figure SMS_81
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task dataset
Figure SMS_82
The candidate sample set of the knowledge points is subjected to the step (2-1)Representative analysis of the samples, resulting in +.>
Figure SMS_83
Sample number +.>
Figure SMS_84
To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculated
Figure SMS_86
Difficulty quantified value of->
Figure SMS_90
For cluster->
Figure SMS_92
Sampling from them the most difficult +.>
Figure SMS_87
Samples, i.e. according to +.>
Figure SMS_89
The value is chosen from small to large +. >
Figure SMS_91
Adding the sample into small sample set, and adding the sample into the sample set to be screened>
Figure SMS_93
All the above sampling processes are carried out to finish knowledge point +.>
Figure SMS_85
Sampling samples, finishing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening the training sample set number of the current task to be +.>
Figure SMS_88
According to the invention, in the step (1), the specific method for acquiring the surface features of the text is as follows:
the text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
According to the present invention, in step (1), the implicit feature specific method for acquiring the text is preferably as follows:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
Preferably, in step (1-2), the method for selecting the memory set by measuring the sample through the representative, differential and balance indexes comprises the following steps:
For a series of tasks
Figure SMS_95
With corresponding data sets on each task
Figure SMS_101
, wherein />
Figure SMS_105
Representing the task currently to be learned, +.>
Figure SMS_96
Representing its corresponding dataset, each sample in the dataset being marked +.>
Figure SMS_99
When the text knowledge reasoning model learns on the current task, the memory set +.>
Figure SMS_103
Performing memory playback, wherein->
Figure SMS_107
For the set union operation, +.>
Figure SMS_94
The operation scope is all tasks before the current task, i.e. +.>
Figure SMS_98
;/>
Figure SMS_102
Representing data set +.>
Figure SMS_106
A set of samples added to the memory set, < >>
Figure SMS_97
Is memory set->
Figure SMS_100
Is abbreviated as (1); said collection->
Figure SMS_104
Satisfying representativeness, variability, and balance;
the representatives, measured using local outlier factors (Local Outlier Factor, LOF), are: because the individual expertise levels and expressions are different, the quality of the samples describing the same knowledge point is uneven and the description forms are various, so that the samples added into the memory set can represent the different quality and the form of all the samples describing the knowledge point corresponding to the samples;
the difference, for the sample
Figure SMS_108
、/>
Figure SMS_109
Difference between using surface layer feature vector and implicit feature vector +.>
Figure SMS_110
The distance is measured, and in order to achieve the model robustness target, a variety of samples are selected to be put into a memory set I.e., there is a difference between the sample characteristics;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
According to a preferred embodiment of the present invention, the representative, specific method of measurement using Local outlier factor (Local OutlierFactor, LOF):
equation (VIII) is used to calculate the distance between two vectors:
Figure SMS_111
the said
Figure SMS_112
Is two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>
Figure SMS_113
Vector->
Figure SMS_114
Equation (IX) is used to derive a vector
Figure SMS_115
To vector->
Figure SMS_116
Is>
Figure SMS_117
The distance can be reached:
Figure SMS_118
wherein ,
Figure SMS_119
expressed in vector space, vector +.>
Figure SMS_120
And (4) the first->
Figure SMS_121
Distance between the near vectors;
equation (X) is used to calculate the first
Figure SMS_122
Local reachable density:
Figure SMS_123
wherein ,
Figure SMS_124
is the pointing quantity->
Figure SMS_125
Is>
Figure SMS_126
A set of all vectors within a distance, wherein the vector +.>
Figure SMS_127
Is>
Figure SMS_128
Distance means vector +.>
Figure SMS_129
Achieve->
Figure SMS_130
Distance of the individual neighboring vectors;
equation (XI) is used to calculate the first
Figure SMS_131
Local outlier factor:
Figure SMS_132
when vector is
Figure SMS_133
Is->
Figure SMS_134
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
When vector is
Figure SMS_135
Is->
Figure SMS_136
When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic space
Figure SMS_137
When the LOF value is smaller, +.>
Figure SMS_138
The larger it is, the more representative is: from the above description of the cases where the LOF value is 1 or less and 1 or more, it is known that the larger the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the less outliers the corresponding vector, the more likely the vector is normal;
formula (XII) is a sample
Figure SMS_139
Is->
Figure SMS_140
The calculation mode is as follows:
Figure SMS_141
wherein ,
Figure SMS_142
is an adjustable parameter, ++>
Figure SMS_143
For indicating the relative importance of the surface features, if the surface features are important, increasing +.>
Figure SMS_144
The method comprises the steps of carrying out a first treatment on the surface of the If the implicit feature is important, then decrease +.>
Figure SMS_145
Default value is 0.5; the division of formula (XII) is to divide the sample distribution by +.>
Figure SMS_146
Always remain in a similar range.
According to a preferred embodiment of the present invention, the variability, for a sample
Figure SMS_147
、/>
Figure SMS_148
The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
As shown in formula (XIII) (XIV) in which
Figure SMS_149
Is a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples:
Figure SMS_150
Figure SMS_151
the candidate sample is selected for memory set when it satisfies equation (XIII) (XIV), which is a step of determining whether there is a difference between the candidate sample and the sample in memory set.
According to a preferred embodiment of the present invention, the balance is selected from a set of samples added to the memory set
Figure SMS_152
Characteristic distribution and raw data set->
Figure SMS_153
The feature distribution is approximate:
Figure SMS_154
wherein ,
Figure SMS_165
representative parameter is->
Figure SMS_155
Is a memory set sample probability distribution +.>
Figure SMS_161
Is->
Figure SMS_163
Shorthand for->
Figure SMS_167
Should be considered as a whole; />
Figure SMS_168
Probability distribution for the original data set samples; />
Figure SMS_170
,/>
Figure SMS_160
Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->
Figure SMS_164
Accompanying a specific parameter which determines its distribution>
Figure SMS_157
Therefore, expressed as +.>
Figure SMS_159
;/>
Figure SMS_158
Is the basic operation of collective operations; />
Figure SMS_162
Representing the sample; />
Figure SMS_166
Representing descriptive knowledge points->
Figure SMS_169
Is a sample of (2); />
Figure SMS_156
A set of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the definition of balance is met.
In the above process, the super ginseng is
Figure SMS_171
The user can use the default value and can also customize to meet the actual business requirements. The invention can automatically select the history samples to be added into the memory set according to the established algorithm for the subsequent training of the expertise point-oriented natural language reasoning model.
An apparatus for implementing the sample selection method, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
the center vector calculating module is used for calculating the center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples and used for selecting the subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by utilizing an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the current task training by utilizing the complete memory set.
A computer device for implementing the sample selection method, characterized in that: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.
The invention has the technical advantages that:
compared with the prior art that the representative sample is selected based on the clustering center, the continuous learning sample selection method for the text knowledge reasoning model can be better adapted to complex text reasoning scenes, and a small number of samples are effectively used for approximating the distribution of the original samples, so that the model memorizes the learned knowledge on the historical task.
According to the invention, the text knowledge reasoning model can be finely tuned and optimized according to the properties and the screened samples of the high-quality memory set sample provided by the practical problem, the model can be better helped to memorize the existing task, and meanwhile, the model is helped to train the current task, so that the robustness of the model in practical use is effectively increased.
In addition, the invention can also be used on other similar tasks in the field of natural language processing, such as continuous learning tasks based on scene memory playback, such as knowledge questions and answers, text classification and the like.
Drawings
FIG. 1 is a schematic diagram of task versus dataset for text-oriented knowledge reasoning in accordance with the present invention;
FIG. 2 is a flow chart of a method for selecting a continuous learning history sample based on text knowledge reasoning;
FIG. 3 is a schematic diagram of a sample selection flow of a text knowledge reasoning-oriented continuous learning history sample selection method in accordance with the present invention;
fig. 4 is a flow chart of a method for continuously learning a current sample selection, which is directed to text knowledge reasoning in the present invention.
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.
Example 1,
As shown in fig. 1 and fig. 2, a sample selection method for continuous learning of a text knowledge inference model includes: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps:
the center vector is obtained by adding the selected sample to the memory sample set, and the selected sample is obtained by obtaining the center vector of all samples describing the same knowledge point, as shown in FIG. 2
Figure SMS_172
In (a) and (b);
when the center vector is obtained:
for the shape like
Figure SMS_174
Is calculated by using the formula (I) (II)>
Figure SMS_176
And implicit feature center vector +.>
Figure SMS_178
, wherein />
Figure SMS_175
Representative dataset +.>
Figure SMS_177
Description of the invention->
Figure SMS_179
Knowledge Point->
Figure SMS_180
Is>
Figure SMS_173
Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
Figure SMS_181
In the formulas (I) (II)
Figure SMS_182
and />
Figure SMS_183
Is->
Figure SMS_184
Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>
Figure SMS_185
Individual surface feature center vectors and +.>
Figure SMS_186
Implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (text feature) of a text: the surface features, expressed using word frequency and inverse document frequency, are noted as
Figure SMS_187
The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>
Figure SMS_188
, wherein />
Figure SMS_189
Is the encoded text;
when the sample is selected, the flow is as shown in fig. 3, and includes:
(1-1) determining the number of samples selected to be added to the memory set:
the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected to be added into the memory set, and recording the current task as the first task according to the sequence of the tasks
Figure SMS_190
Task, history->
Figure SMS_191
The sample size selected in the data set of the individual tasks is +.>
Figure SMS_192
As formula (III), wherein ∈>
Figure SMS_193
The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task:
Figure SMS_194
determined from equation (IV)
Figure SMS_195
Task data set->
Figure SMS_196
The number of samples selected from the relevant samples of the knowledge points is +.>
Figure SMS_197
Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
Figure SMS_198
wherein
Figure SMS_199
Indicate->
Figure SMS_200
A plurality of data sets; />
Figure SMS_201
Indicate->
Figure SMS_202
Description of the data set->
Figure SMS_203
A dataset of individual knowledge points;
(1-2) taking samples:
selecting a memory set by measuring samples according to representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when the samples are selected;
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps:
sample representative analysis, sample difficult analysis, and sample sampling; the flow is shown in fig. 4;
in order to reduce training cost, for the current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of the overall samples, so that representative and difficult samples need to be extracted from the current task data set for training of a text knowledge reasoning model; in view of the purpose of screening samples to fine tune a text knowledge reasoning model, on one hand, the representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, and on the other hand, a difficult sample needs to be selected, and can carry more information which can benefit the model, so that the invention combines the representativeness and the difficulty to perform sample screening on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (th
Figure SMS_215
Personal task) data set>
Figure SMS_206
Samples in a candidate sample set of knowledge points
Figure SMS_211
Is a set of hypothesized text surface feature vectors +.>
Figure SMS_219
Go->
Figure SMS_222
Clustering, designating the number of clusters as +.>
Figure SMS_220
,/>
Figure SMS_223
Can be determined according to the number of knowledge points of the previous task, namely, the number similar to the number of knowledge points of the previous task is selected, and the +.>
Figure SMS_212
Personal clusters->
Figure SMS_216
The present task is +.>
Figure SMS_204
The number of samples to be extracted for each knowledge point is +.>
Figure SMS_208
For each cluster +.>
Figure SMS_207
Calculating sample variance in clusters>
Figure SMS_210
Sample number in cluster +.>
Figure SMS_214
To analyze sample representatives: variance->
Figure SMS_218
A cluster of large and large sample number, wherein each sample is less representative of the cluster, more samples need to be sampled from the cluster to maintain the representative of the cluster, determining the cluster according to formula (V)>
Figure SMS_213
Sample number of middle samples +.>
Figure SMS_217
,/>
Figure SMS_221
Meaning +.>
Figure SMS_224
In the task, for->
Figure SMS_205
First->
Figure SMS_209
The number of samples selected from the clusters:
Figure SMS_225
Figure SMS_226
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning models
Figure SMS_227
Input sample->
Figure SMS_228
Precondition text +. >
Figure SMS_229
And assume text +.>
Figure SMS_230
Carrying out reasoning prediction to obtain a category set +.>
Figure SMS_231
Upper predictive probability distribution->
Figure SMS_232
,
Figure SMS_235
Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>
Figure SMS_238
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>
Figure SMS_241
Difficulty (S)>
Figure SMS_234
,/>
Figure SMS_237
The smaller the representation the inference model is not confidence in its predicted outcome, so the more difficult the sample is: wherein->
Figure SMS_240
Representing a professional text reasoning model->
Figure SMS_242
Prediction sample category +.>
Figure SMS_233
Probability of->
Figure SMS_236
Representing the text reasoning model by profession->
Figure SMS_239
Predicting the probability of the sample category c;
Figure SMS_243
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task dataset
Figure SMS_244
Performing the sample representative analysis described in step (2-1) on the candidate sample set of knowledge points to obtain from each cluster +.>
Figure SMS_245
Sample number +.>
Figure SMS_246
To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculated
Figure SMS_248
Difficulty quantified value of->
Figure SMS_251
For cluster->
Figure SMS_253
Sampling from them the most difficult +.>
Figure SMS_249
Samples, i.e. according to +.>
Figure SMS_252
The value is chosen from small to large +. >
Figure SMS_254
Adding the sample into small sample set, and adding the sample into the sample set to obtain the sample>
Figure SMS_255
All the above sampling processes are carried out to finish knowledge point +.>
Figure SMS_247
Sampling samples, finishing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening the training sample set number of the current task to be +.>
Figure SMS_250
EXAMPLE 2,
The embodiment 1 of a sample selection method for continuous learning of a text knowledge-oriented reasoning model, in step (1), a specific method for obtaining surface features of the text is as follows:
the text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
In the step (1), the specific method for acquiring the implicit characteristic of the text comprises the following steps:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
EXAMPLE 3,
The method for selecting a memory set by measuring samples for representative, differential and balance indexes in step (1-2) according to the sample selection method for continuous learning of a text knowledge inference model in embodiments 1 and 2 comprises the following steps:
For a series of tasks
Figure SMS_257
With corresponding data sets on each task
Figure SMS_260
, wherein />
Figure SMS_264
Representing the task currently to be learned, +.>
Figure SMS_258
Representing its corresponding dataset, each sample in the dataset being marked +.>
Figure SMS_261
When the text knowledge reasoning model learns on the current task, the memory set +.>
Figure SMS_265
Performing memory playback, wherein->
Figure SMS_268
For the set union operation, +.>
Figure SMS_256
The operation scope is all tasks before the current task, i.e. +.>
Figure SMS_262
;/>
Figure SMS_266
Representing data set +.>
Figure SMS_269
A set of samples added to the memory set, < +.>
Figure SMS_259
Is memory set->
Figure SMS_263
Is abbreviated as (1); said collection->
Figure SMS_267
Satisfying representativeness, variability, and balance;
the representatives, measured using local outlier factors (Local Outlier Factor, LOF), are: because the individual expertise levels and expressions are different, the quality of the samples describing the same knowledge point is uneven and the description forms are various, so that the samples added into the memory set can represent the different quality and the form of all the samples describing the knowledge point corresponding to the samples;
the difference, for the sample
Figure SMS_270
Difference between using surface layer feature vector and implicit feature vector +.>
Figure SMS_271
The distance is measured, in order to achieve the model robustness target, a plurality of samples are selected to be put into a memory set, namely, the sample features have differences;
The balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
The representative, specific method of measurement using local outlier factors (Local Outlier Factor, LOF):
equation (VIII) is used to calculate the distance between two vectors:
Figure SMS_272
the said
Figure SMS_273
Is two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>
Figure SMS_274
Vector->
Figure SMS_275
Equation (IX) is used to derive a vector
Figure SMS_276
To vector->
Figure SMS_277
Is>
Figure SMS_278
The distance can be reached:
Figure SMS_279
wherein ,
Figure SMS_280
expressed in vector space, vector +.>
Figure SMS_281
And (4) the first->
Figure SMS_282
Distance between the near vectors;
equation (X) is used to calculate the first
Figure SMS_283
Local reachable density:
Figure SMS_284
wherein ,
Figure SMS_285
is the pointing quantity->
Figure SMS_286
Is>
Figure SMS_287
A set of all vectors within a distance, wherein the vector +.>
Figure SMS_288
Is>
Figure SMS_289
Distance means vector +.>
Figure SMS_290
Achieve->
Figure SMS_291
Distance of the individual neighboring vectors;
equation (XI) is used to calculate the first
Figure SMS_292
Local outlier factor:
Figure SMS_293
when vector is
Figure SMS_294
Is->
Figure SMS_295
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
When vector is
Figure SMS_296
Is->
Figure SMS_297
When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic space
Figure SMS_298
When the LOF value is smaller, +.>
Figure SMS_299
The larger it is, the more representative is: from the above description of the cases where the LOF value is 1 or less and 1 or more, it is known that the larger the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the less outliers the corresponding vector, the more likely the vector is normal;
formula (XII) is a sample
Figure SMS_300
Is->
Figure SMS_301
The calculation mode is as follows:
Figure SMS_302
wherein ,
Figure SMS_303
is an adjustable parameter, ++>
Figure SMS_304
For indicating the relative importance of the surface features, if the surface features are important, increasing +.>
Figure SMS_305
The method comprises the steps of carrying out a first treatment on the surface of the If the implicit feature is important, then decrease +.>
Figure SMS_306
Default value is 0.5; the division of formula (XII) is to divide the sample distribution by +.>
Figure SMS_307
Always remain in a similar range.
The difference, for the sample
Figure SMS_308
The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
As shown in formula (XIII) (XIV) in which
Figure SMS_309
Is a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples: />
Figure SMS_310
Figure SMS_311
The candidate sample is selected for memory set when it satisfies equation (XIII) (XIV), which is a step of determining whether there is a difference between the candidate sample and the sample in memory set.
The balance, sample set selected to be added to the memory set
Figure SMS_312
Characteristic distribution and raw data set->
Figure SMS_313
The feature distribution is approximate:
Figure SMS_314
wherein ,
Figure SMS_324
representative parameter is->
Figure SMS_317
Is a memory set sample probability distribution +.>
Figure SMS_320
Is->
Figure SMS_326
Shorthand for->
Figure SMS_329
Should be considered as a whole; />
Figure SMS_328
Probability distribution for the original data set samples; />
Figure SMS_330
,/>
Figure SMS_323
Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->
Figure SMS_327
Accompanying a specific parameter which determines its distribution>
Figure SMS_315
Therefore, expressed as +.>
Figure SMS_319
;/>
Figure SMS_318
Is the basic operation of collective operations; />
Figure SMS_322
Representing the sample; />
Figure SMS_321
Representing descriptive knowledge points->
Figure SMS_325
Is a sample of (2); />
Figure SMS_316
A set of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the definition of balance is met.
In the above process, the super ginseng is
Figure SMS_331
The user can use the default value and can also customize to meet the actual business requirements. The invention can automatically select the history samples to be added into the memory set according to the established algorithm for the subsequent training of the expertise point-oriented natural language reasoning model.
EXAMPLE 5,
An apparatus for implementing the sample selection method of embodiments 1-4, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
the center vector calculating module is used for calculating the center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples and used for selecting the subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by utilizing an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the current task training by utilizing the complete memory set.
EXAMPLE 6,
A computer apparatus implementing the sample selection method of embodiments 1-4, comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.

Claims (8)

1. A sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;
(1) The historical task sample selection comprises the following steps: obtaining a center vector and selecting a sample;
when the center vector is obtained: for the shape like
Figure QLYQS_2
Is calculated by using the formula (I) (II)>
Figure QLYQS_4
And implicit feature center vector +.>
Figure QLYQS_6
, wherein />
Figure QLYQS_3
Representative dataset +.>
Figure QLYQS_5
Description of the invention->
Figure QLYQS_7
Knowledge Point->
Figure QLYQS_8
Is>
Figure QLYQS_1
Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:
Figure QLYQS_9
in the formulas (I) (II)
Figure QLYQS_10
and />
Figure QLYQS_11
Is->
Figure QLYQS_12
Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>
Figure QLYQS_13
Individual surface feature center vectors and +.>
Figure QLYQS_14
Implicit feature center vectors;
acquiring surface layer characteristics and implicit characteristics of a text: the surface features, expressed using word frequency and inverse document frequency, are noted as
Figure QLYQS_15
The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>
Figure QLYQS_16
, wherein />
Figure QLYQS_17
Is the encoded text;
the sample selecting process includes:
(1-1) determining the number of samples selected to be added to the memory set:
(1-2) selecting samples, namely selecting a memory set by measuring samples by using representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when selecting the samples,
scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;
scheme (2): each sample is traversed randomly with equal probability;
if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;
otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling;
(2-1) sample representative analysis in the current task sample selection;
(2-2) sample difficulty analysis in sample selection of the current task;
(2-3) sampling samples in the sample selection of the current task;
in step (1):
the (1-1) determines the number of samples selected to be added into the memory set:
recording the current task as the first according to the sequence of the tasks
Figure QLYQS_18
Task, history- >
Figure QLYQS_19
The sample size selected in the data set of the individual tasks is +.>
Figure QLYQS_20
As formula (III), wherein ∈>
Figure QLYQS_21
The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task: />
Figure QLYQS_22
Determining the ith from the data set of the ith task by formula (IV)
Figure QLYQS_23
The number of samples selected from the relevant samples of the knowledge points is +.>
Figure QLYQS_24
Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:
Figure QLYQS_25
wherein
Figure QLYQS_26
Indicate->
Figure QLYQS_27
A plurality of data sets; />
Figure QLYQS_28
Indicate->
Figure QLYQS_29
Description of the data set->
Figure QLYQS_30
A dataset of individual knowledge points;
in step (2), the sample representative analysis in the current task sample selection (2-1) includes:
for the current task dataset
Figure QLYQS_42
Sample in candidate sample set of knowledge points +.>
Figure QLYQS_32
Is a set of hypothesized text surface feature vectors +.>
Figure QLYQS_38
Go->
Figure QLYQS_34
Clustering, designating the number of clusters as +.>
Figure QLYQS_36
According to formula (VI), get +.>
Figure QLYQS_40
Personal clusters->
Figure QLYQS_44
The present task is +.>
Figure QLYQS_39
The number of samples required to be extracted from each knowledge point is
Figure QLYQS_43
For each cluster +.>
Figure QLYQS_31
Calculating sample variance in clusters>
Figure QLYQS_35
Number of samples in cluster
Figure QLYQS_45
To analyze sample representatives: determining the cluster according to formula (V)>
Figure QLYQS_47
Sample number of middle samples +.>
Figure QLYQS_46
,/>
Figure QLYQS_48
Meaning of the first
Figure QLYQS_33
In the task, for->
Figure QLYQS_37
First->
Figure QLYQS_41
The number of samples selected from the clusters:
Figure QLYQS_49
Figure QLYQS_50
(2-2) sample difficulty analysis in sample selection for the current task, comprising:
by means of pre-trained professional text reasoning models
Figure QLYQS_51
Input sample->
Figure QLYQS_52
Precondition text +.>
Figure QLYQS_53
And assume text +.>
Figure QLYQS_54
Carrying out reasoning prediction to obtain a category set +.>
Figure QLYQS_55
Upper predictive probability distribution->
Figure QLYQS_56
,
Figure QLYQS_58
Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>
Figure QLYQS_62
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>
Figure QLYQS_64
Difficulty (S)>
Figure QLYQS_59
: wherein />
Figure QLYQS_61
Representing a professional text reasoning model->
Figure QLYQS_63
Prediction sample category +.>
Figure QLYQS_65
Probability of->
Figure QLYQS_57
Representing the text reasoning model by profession->
Figure QLYQS_60
Predicting the probability of the sample category c;
Figure QLYQS_66
(2-3) sample sampling in the sample selection of the current task, including:
(2-3-1) for the current task dataset
Figure QLYQS_67
Performing the sample representative analysis described in step (2-1) on the candidate sample set of knowledge points to obtain from each cluster +.>
Figure QLYQS_68
Sample number +.>
Figure QLYQS_69
To maintain a representative of the screened sample set;
(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculated
Figure QLYQS_71
Difficulty quantified value of->
Figure QLYQS_75
For cluster->
Figure QLYQS_77
Sampling from them the most difficult +.>
Figure QLYQS_72
Samples, i.e. according to +.>
Figure QLYQS_74
The value is chosen from small to large +.>
Figure QLYQS_76
Adding the sample into small sample set, and adding the sample into the sample set to obtain the sample>
Figure QLYQS_78
All the above sampling processes are carried out to complete knowledge points
Figure QLYQS_70
Sampling the sample, and completing the current process of all knowledge points in the current taskData screening on task, finally screening training sample set number of current task to be +.>
Figure QLYQS_73
In step (1-2), a method for selecting a memory set from a representative, differential, balance index measurement sample, comprising:
the representativeness is measured using a local outlier factor;
the difference, for the sample
Figure QLYQS_79
Difference between using surface layer feature vector and implicit feature vector +.>
Figure QLYQS_80
Measuring the distance;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
2. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein in step (1), the specific method for obtaining the surface features of the text is as follows:
The text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.
3. The sample selection method for continuous learning of a text-oriented knowledge reasoning model according to claim 1, wherein in step (1), the implicit feature concrete method for obtaining the text is as follows:
a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.
4. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein the representative, specific method for measuring by using local outlier factors:
equation (VIII) is used to calculate the distance between two vectors:
Figure QLYQS_81
the said
Figure QLYQS_82
Is two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>
Figure QLYQS_83
Vector->
Figure QLYQS_84
Equation (IX) is used to derive a vector
Figure QLYQS_85
To vector->
Figure QLYQS_86
Is>
Figure QLYQS_87
The distance can be reached:
Figure QLYQS_88
wherein ,
Figure QLYQS_89
expressed in vector space, vector +.>
Figure QLYQS_90
And (4) the first->
Figure QLYQS_91
Distance between the near vectors;
equation (X) is used to calculate the first
Figure QLYQS_92
Local reachable density:
Figure QLYQS_93
wherein ,
Figure QLYQS_94
is the pointing quantity->
Figure QLYQS_95
Is>
Figure QLYQS_96
A set of all vectors within a distance, wherein the vector +.>
Figure QLYQS_97
Is>
Figure QLYQS_98
Distance means vector +.>
Figure QLYQS_99
Achieve->
Figure QLYQS_100
Distance of the individual neighboring vectors;
equation (XI) is used to calculate the first
Figure QLYQS_101
Local outlier factor:
Figure QLYQS_102
when vector is
Figure QLYQS_103
Is->
Figure QLYQS_104
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;
when vector is
Figure QLYQS_105
Is->
Figure QLYQS_106
When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;
obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic space
Figure QLYQS_107
When the LOF value is smaller, +.>
Figure QLYQS_108
The larger;
formula (XII) is a sample
Figure QLYQS_109
Is->
Figure QLYQS_110
The calculation mode is as follows:
Figure QLYQS_111
wherein ,
Figure QLYQS_112
is an adjustable parameter, ++>
Figure QLYQS_113
For representing the relative importance of the surface features.
5. The sample selection method for text knowledge oriented reasoning model continuous learning of claim 1 wherein the variability, for samples
Figure QLYQS_114
The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:
as shown in formula (XIII) (XIV) in which
Figure QLYQS_115
Is the difference threshold:
Figure QLYQS_116
Figure QLYQS_117
candidate samples satisfy equation (XIII) (XIV) and are entered into the selection memory set.
6. The method for sample selection for text-oriented knowledge reasoning model continuous learning as claimed in claim 1, wherein said balance is selected from a set of samples added to a set of memories
Figure QLYQS_118
Characteristic distribution and raw data set->
Figure QLYQS_119
The feature distribution is approximate:
Figure QLYQS_120
wherein ,
Figure QLYQS_130
representative parameter is->
Figure QLYQS_123
Is a memory set sample probability distribution +.>
Figure QLYQS_126
Is->
Figure QLYQS_127
Shorthand for->
Figure QLYQS_131
Should be considered as a whole; />
Figure QLYQS_134
Probability distribution for the original data set samples; />
Figure QLYQS_136
,/>
Figure QLYQS_132
Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->
Figure QLYQS_135
Accompanied by a specific determination of its distributionParameter->
Figure QLYQS_121
Therefore, expressed as +.>
Figure QLYQS_128
;/>
Figure QLYQS_124
Is the basic operation of collective operations; />
Figure QLYQS_125
Representing the sample; />
Figure QLYQS_129
Representing descriptive knowledge points->
Figure QLYQS_133
Is a sample of (2); />
Figure QLYQS_122
Is a set of knowledge points.
7. An apparatus for implementing the sample selection method of any one of claims 1-6, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;
The center vector calculating module is used for calculating center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples;
the sample selection model selects a proper sample and adds the proper sample into the memory set, and finally a complete memory set is obtained;
and the training module is used for assisting the current task training by utilizing the complete memory set.
8. A computer device implementing the sample selection method of any of claims 1-6, characterized by: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:
in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.
CN202310107542.8A 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning Active CN115829036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310107542.8A CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310107542.8A CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Publications (2)

Publication Number Publication Date
CN115829036A CN115829036A (en) 2023-03-21
CN115829036B true CN115829036B (en) 2023-05-05

Family

ID=85521149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310107542.8A Active CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Country Status (1)

Country Link
CN (1) CN115829036B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US9477654B2 (en) * 2014-04-01 2016-10-25 Microsoft Corporation Convolutional latent semantic models and their applications
CN112347268B (en) * 2020-11-06 2024-03-19 华中科技大学 Text-enhanced knowledge-graph combined representation learning method and device
CN115563315A (en) * 2022-12-01 2023-01-03 东南大学 Active complex relation extraction method for continuous few-sample learning
CN115618045B (en) * 2022-12-16 2023-03-14 华南理工大学 Visual question answering method, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training

Also Published As

Publication number Publication date
CN115829036A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
Sun et al. Interactive genetic algorithms with large population and semi-supervised learning
CN113128369B (en) Lightweight network facial expression recognition method fusing balance loss
CN109993236A (en) Few sample language of the Manchus matching process based on one-shot Siamese convolutional neural networks
CN110826639B (en) Zero sample image classification method trained by full data
CN109858015A (en) A kind of semantic similarity calculation method and device based on CTW and KM algorithm
Kamada et al. An adaptive learning method of restricted Boltzmann machine by neuron generation and annihilation algorithm
Callan et al. Self-organizing map for the classification of normal and disordered female voices
CN110349597A (en) A kind of speech detection method and device
CN116821698B (en) Wheat scab spore detection method based on semi-supervised learning
CN106448660B (en) It is a kind of introduce big data analysis natural language smeared out boundary determine method
CN116805533A (en) Cerebral hemorrhage operation risk prediction system based on data collection and simulation
Wang et al. Soft focal loss: Evaluating sample quality for dense object detection
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN113330462A (en) Neural network training using soft nearest neighbor loss
Liu et al. Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism.
CN115829036B (en) Sample selection method and device for text knowledge reasoning model continuous learning
Zhan A convolutional network-based intelligent evaluation algorithm for the quality of spoken English pronunciation
Lee et al. Learning non-homogenous textures and the unlearning problem with application to drusen detection in retinal images
Shukla et al. A novel stochastic deep conviction network for emotion recognition in speech signal
Wang et al. Intelligent radar HRRP target recognition based on CNN-BERT model
Gumelar et al. Transformer-CNN automatic hyperparameter tuning for speech emotion recognition
Perez et al. Face Patches Designed through Neuroevolution for Face Recognition with Large Pose Variation
CN111523649A (en) Method and device for preprocessing data aiming at business model
Zhou et al. Research on intelligent diagnosis algorithm of diseases based on machine learning
CN116824237A (en) Image recognition and classification method based on two-stage active learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant