CN115829036A - Sample selection method and device for continuous learning of text knowledge inference model - Google Patents

Sample selection method and device for continuous learning of text knowledge inference model Download PDF

Info

Publication number
CN115829036A
CN115829036A CN202310107542.8A CN202310107542A CN115829036A CN 115829036 A CN115829036 A CN 115829036A CN 202310107542 A CN202310107542 A CN 202310107542A CN 115829036 A CN115829036 A CN 115829036A
Authority
CN
China
Prior art keywords
sample
samples
vector
text
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310107542.8A
Other languages
Chinese (zh)
Other versions
CN115829036B (en
Inventor
孙宇清
杨磊稳
马磊
杨涛
袁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310107542.8A priority Critical patent/CN115829036B/en
Publication of CN115829036A publication Critical patent/CN115829036A/en
Application granted granted Critical
Publication of CN115829036B publication Critical patent/CN115829036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A sample selection method and a device for continuous learning of a text knowledge inference model belong to the technical field of natural language inference and comprise selection of historical task samples and selection of current task samples; wherein the selection of the historical task sample comprises the following steps: determining the number of samples selected to be added into a memory set; selecting a sample, namely selecting a memory set by measuring the sample through representativeness, difference and balance indexes, and traversing the sample by utilizing one of the following two schemes when selecting the sample; wherein, the current task sample selection comprises: sample representative analysis, sample difficulty analysis, and sample sampling. The method can give consideration to sample properties such as representativeness, balance and difference, can better adapt to a complex text reasoning scene compared with a method for selecting representative samples based on a clustering center in the prior art, effectively uses a small amount of samples to approximate the distribution of original samples, and enables a model to memorize the learned knowledge on a historical task.

Description

Sample selection method and device for continuous learning of text knowledge inference model
Technical Field
The invention relates to a sample selection method and a sample selection device for continuous learning of a text knowledge inference model, and belongs to the technical field of natural language inference.
Background
The natural language reasoning task refers to giving a precondition text and a hypothesis text, taking the precondition text as a standard, and judging different situations of correctness, mistakes, independence and the like of the hypothesis text. Textual knowledge reasoning is a special form of natural language reasoning tasks, where prerequisite text refers to knowledge points in or description of facts related to knowledge points in a domain of expertise, and hypothesis text refers to different people describing their understanding or cognitive results to knowledge points in prerequisite text. For example, in the test of economic law, the precondition text refers to professional knowledge in the field of economic law or description of related facts, and the reference answer corresponding to the test question is shown as "when the natural stockholders of the company with limited responsibility change due to inheritance according to the provisions of the legal system of the company, other stockholders claim to exercise priority purchasing right, and the national court is not supported, except that the provision of the chapter of the company is otherwise provided or the agreement of all stockholders is otherwise provided. The "hypothesis text" refers to the understanding result of different people on the knowledge points, and corresponds to the answer of the examinee in the above case, such as "make a request to exercise limited purchasing right, and the people's court is not supported. In the case where the stockholder of the limited responsibility company comes off, the stock right is inherited by its successor. In this example, the text knowledge reasoning task is to judge whether the hypothesis text, i.e. the answer of the examinee, is correct according to the precondition text, i.e. the reference answer. The text knowledge reasoning has important application value in the fields of subjective question review, professional knowledge question answering, knowledge reasoning and the like.
In the technical knowledge text reasoning problem, the category number of the knowledge points in the technical field is huge, the description forms of the knowledge points are various, the content and the forms of the precondition text are continuously updated, and the assumed text has uneven quality and various forms because the assumed text is closely related to the individual professional knowledge level and the expression capacity. The problems of the precondition text and the hypothesis text cause that the sample describing the same knowledge point has high confusion and difficult identification, the number of the samples corresponding to the low-frequency used knowledge point is small, and the cold professional knowledge point has the problem of lacking of labeled samples. In the face of continuously increasing knowledge sample data, especially knowledge points which are not related in historical sample data, the intelligent model not only needs to solve sample challenges of few samples, noise and the like, but also needs to solve continuous learning challenges, namely, existing knowledge is not forgotten while a new knowledge point is learned, and the purpose of increasing the generalization ability and robustness of the model is achieved.
The continuous learning is introduced to enable the text knowledge reasoning intelligent model to well complete new problems and process historical tasks with good performance. In the field of artificial intelligence, a memory replay strategy is the most effective continuous learning method, for example, an article published in 2019 by Wang, hong et al: "area encoding alignment for lifting relationship extraction" arXiv preprinting arXiv:1903.02588.
The aim of continuous learning is achieved by storing partial samples of previous tasks to participate in next training, wherein a set formed by the partial samples of the previous tasks is called a memory set, and the quality of the samples in the memory set determines the performance of the inference model on historical tasks.
For example: chinese patent document CN114722892A provides a continuous learning method and device based on machine learning, which uses a historical data training generator, and uses the generator to generate a pseudo sample set of corresponding tasks as a memory set, and this method is difficult to ensure the quality of generated samples and affects the continuous learning effect.
Chinese patent document CN113688882A proposes a training method and device for a memory-enhanced continuous learning neural network model, which is inspired by human brain memory playback, utilizes a simple data playback method to construct an expandable memory module by storing mean values and variances of data, and realizes the memory enhancement effect on the original task.
Chinese patent document CN113590958A discloses a continuous learning method of a sequence recommendation model based on sample playback, which samples a small portion of representative sample samples according to an item class balancing policy to generate a memory set, and this method does not consider the difficulty and difference of the sample. In conclusion, the existing work is difficult to meet the continuous learning requirement of the text knowledge reasoning model.
In summary, the problems of the prior art include: aiming at the problems of various forms and uneven quality of a sample application book describing the same knowledge; aiming at the problem of unbalanced sample quantity on the knowledge point category coverage and the knowledge point category; when a sample added into a memory set is selected, the problem of high repeatability of the form or quality of the sample describing the same knowledge point is solved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a sample selection method for continuous learning of a text knowledge reasoning model.
The invention also discloses a device for realizing the sample selection method.
Aiming at the problems, the invention provides a representative index aiming at the problems of various forms and uneven quality of a sample book describing the same knowledge; aiming at the problems of unbalanced sample quantity on the knowledge point category coverage and the knowledge point category, a balance index is provided; in order to prevent the problem that the form or quality repeatability of a sample describing the same knowledge point is high in the process of selecting the sample added into a memory set, a difference index is provided. Based on the method, the invention also provides a plurality of selection strategies and sample selection technologies which give consideration to the sample quality and the sample characteristic distribution, improves the model performance and robustness of the continuous learning of the professional knowledge text reasoning, and has theoretical significance for other text understanding tasks.
Interpretation of professional terms
1. Professional knowledge: the technical fields such as finance, law, accounting and the like refer to theories, technologies, concepts, fact descriptions and other texts, and are different from common knowledge and common knowledge.
2. Professional knowledge points: minimum units of professional knowledge, using compassesNormalized text description form, hereinafter referred to as knowledge points, is denoted
Figure SMS_1
3. A precondition text: refers to a point of expertise or a description of facts relating to a point of expertise. Multiple precondition texts may be described for the same knowledge point and recorded as
Figure SMS_2
4. Assume that the text: refers to a textual description of the results of different people's understanding of the points of knowledge in the field of expertise. A precondition text may have multiple corresponding hypothesis texts, which are recorded as
Figure SMS_3
5. Task: in the model continuous learning process, learning is carried out from a series of tasks, the tasks have a time sequence relation, and the model learning is carried out on each task independently.
6. Data set: each task has its own data set, each sample of the data set
Figure SMS_4
Is in the shape of
Figure SMS_5
Of a tuple of (1), wherein
Figure SMS_6
In order to precondition the text, the user can,
Figure SMS_7
in order to assume a text, the user may,
Figure SMS_8
wherein 0, 1 and 2 respectively represent that the sample label is inclusion, contradiction or neutral. The relationship of tasks to datasets is shown in FIG. 1.
7. The sequence-Bert model: refers to the model described in the literature Reimers N, gurevych I. Sennce-bert: sennce embedding parameter-networks [ J ]. ArXiv preprint arXiv:1908.10084, 2019.
8. SenterTransformer: is a code implementation of the sequence-Bert model using the pytorech framework of python, and currently there is no Chinese translation.
The detailed technical scheme of the invention is as follows:
a sample selection method for continuous learning of a text knowledge inference model is characterized by comprising the following steps: selecting a historical task sample and selecting a current task sample;
(1) Wherein the selection of the historical task sample comprises the following steps:
obtaining a central vector and a selected sample, wherein the central vector refers to the central vectors of all samples describing the same knowledge point, and the selected sample is a proper sample selected to be added into a memory sample set
Figure SMS_9
Performing the following steps;
when the central vector is obtained:
for the shapes of
Figure SMS_11
Using formula (I) (II) to calculate the surface feature center vector
Figure SMS_14
And implicit feature center vectors
Figure SMS_16
, wherein
Figure SMS_12
Representative data set
Figure SMS_13
Description of (A) to
Figure SMS_15
A point of knowledge
Figure SMS_17
All of the samples of (a) are,
Figure SMS_10
respectively obtaining the functions of the surface characteristic and the implicit characteristic of the text:
Figure SMS_18
in formulas (I) and (II)
Figure SMS_19
And
Figure SMS_20
is/are as follows
Figure SMS_21
The order of the center vector of the surface feature and the center vector of the implicit feature means
Figure SMS_22
Center vector of surface feature and
Figure SMS_23
implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (late feature) of a text: the surface layer features are expressed by word frequency and inverse document frequency and are recorded as
Figure SMS_24
The implicit features, expressed using the sequence-BERT vector, are noted
Figure SMS_25
, wherein
Figure SMS_26
Is the text that is coded;
when selecting the sample, the process comprises:
(1-1) determining the number of samples selected to be added into the memory set:
the formulas (III) - (IV) are to determine how many samples describing the same knowledge point are selected to be added into a memory set, and to remember the current task as the first task according to the sequence of the tasks
Figure SMS_27
Individual task, history
Figure SMS_28
The amount of samples selected in the data set of each task is
Figure SMS_29
As in formula (III), wherein
Figure SMS_30
The total amount of samples to be selected for model training, namely the sum of the memory set sample size and the current task selection sample size:
Figure SMS_31
determining the ith from the data set of the ith task by equation (IV)
Figure SMS_32
The number of samples selected from the related samples of each knowledge point is
Figure SMS_33
So that the number distribution of the samples of each knowledge point extracted is consistent with that of the original data set:
Figure SMS_34
wherein
Figure SMS_35
Is shown as
Figure SMS_36
A data set;
Figure SMS_37
is shown as
Figure SMS_38
Described in a data set
Figure SMS_39
A data set of individual knowledge points;
(1-2) selecting a sample:
selecting a memory set by measuring samples through representativeness, difference and balance indexes, and traversing the samples by utilizing one of the following two schemes when selecting the samples;
scheme (1): traversing according to the distance between the sample vector and the central vector in an ascending order, wherein the vector comprises a surface feature vector and an implicit feature vector;
scheme (2): traversing each sample at equal probability randomly;
if the traversed samples meet the representativeness, the difference and the balance, adding a memory set;
otherwise, abandoning the traversed samples, and continuing to perform the next traversal until the number of the selected samples meets the requirement;
(2) Wherein, the current task sample selection comprises:
sample representative analysis, sample difficulty analysis and sample sampling; the process is as follows;
in order to reduce training cost, aiming at a current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of a total sample, so that representative and difficult samples need to be extracted from a current task data set for training a text knowledge inference model; in view of the fact that the purpose of screening samples is to finely tune a text knowledge inference model, on one hand, representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, on the other hand, a difficult sample needs to be selected, and the difficult sample can carry more information beneficial to the model, therefore, the method combines two indexes of representativeness and difficulty to screen the sample on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (first
Figure SMS_51
Task) data set
Figure SMS_42
Samples in a candidate set of samples of knowledge points
Figure SMS_47
Is a set of hypothesized text surface feature vectors
Figure SMS_55
To carry out
Figure SMS_58
Clustering, assigning a cluster number of
Figure SMS_57
Figure SMS_60
Can be determined according to the number of the knowledge points of the past task, namely, the number similar to the number of the knowledge points of the past task is selected to obtain the knowledge points according to a formula (VI)
Figure SMS_48
Cluster of a plurality of clusters
Figure SMS_53
In the current task
Figure SMS_40
The number of samples to be extracted for each knowledge point is
Figure SMS_44
For each cluster
Figure SMS_43
Calculating the variance of samples in a cluster
Figure SMS_45
And number of samples in a cluster
Figure SMS_49
To analyze sample representativeness: variance (variance)
Figure SMS_52
Large and large number of samples of clusters, whichThe cluster is determined according to the formula (V) in that each sample has a low representativeness to the cluster, more samples need to be sampled from the cluster to maintain the representativeness to the cluster
Figure SMS_50
Number of samples sampled in
Figure SMS_54
,
Figure SMS_56
Means first
Figure SMS_59
In the individual task, aim at
Figure SMS_41
The first knowledge point
Figure SMS_46
Number of samples selected in each cluster:
Figure SMS_61
Figure SMS_62
(2-2) the sample difficulty analysis in the sample selection of the current task comprises the following steps:
professional text reasoning model with pre-training
Figure SMS_63
Inputting samples
Figure SMS_64
Is a precondition text
Figure SMS_65
And hypothesis text
Figure SMS_66
Making inference prediction to obtain class set
Figure SMS_67
Up-predicted probability distribution
Figure SMS_68
,
Figure SMS_69
, wherein
Figure SMS_70
Representing class label with maximum probability, using formula (VII) to calculate inference model for sample
Figure SMS_71
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution is used to measure the sample
Figure SMS_72
Difficulty of
Figure SMS_73
,
Figure SMS_74
Figure SMS_75
The smaller the representation inference model is not confident in its prediction results, the more difficult the sample is, wherein
Figure SMS_76
Representing professional text reasoning models
Figure SMS_77
Predicting a sample class of
Figure SMS_78
The probability of (a) of (b) being,
Figure SMS_79
representation by professional text reasoning model
Figure SMS_80
Predicting the probability of the sample class being c;
Figure SMS_81
(2-3) sampling samples in the sample selection of the current task, wherein the sampling comprises the following steps:
(2-3-1) the fourth to the current task data set
Figure SMS_82
Performing sample representative analysis on the candidate sample set of the knowledge points in the step (2-1) to obtain a sample representative analysis result from each cluster
Figure SMS_83
Number of sample samples in
Figure SMS_84
To maintain the representativeness of the screened sample set;
(2-3-2) thereafter subjecting the sample to the sample difficulty analysis described in the step (2-2), calculating the sample
Figure SMS_86
Of the difficulty quantization value
Figure SMS_90
For clusters
Figure SMS_92
From which sampling is most difficult
Figure SMS_87
Samples, i.e. according to the sample
Figure SMS_89
Selecting from small to large value
Figure SMS_91
Adding the individual samples into the small sample set to be screened, and clustering
Figure SMS_93
All carry out the above sampling processPoints of adult knowledge
Figure SMS_85
Sampling samples, completing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening out the training sample sets of the current task with the number of
Figure SMS_88
According to the present invention, preferably, in step (1), the method for acquiring the surface layer features of the text specifically includes:
utilizing a coarse-grained word segmentation device of HanLP to segment words of the text, removing common Chinese stop words from the segmentation result, and screening out words which only appear once to obtain dictionaries of all the texts; and finally, calculating the TF-IDF feature vector of each text as the surface feature of the sample.
According to the present invention, preferably, in step (1), the specific method for obtaining the implicit characteristic of the text comprises:
the method comprises the steps of using a specific implementation of a Sennce-Bert model to realize a Sennce transform, loading a pre-training model parahrase-multilinual-mpnet-base-v 2 to encode text, and using an encoded Sennce-Bert feature vector to express implicit features of the text.
Preferably, in step (1-2), the method for selecting the memory set by measuring the sample according to the representative, difference and balance indexes comprises the following steps:
for a series of tasks
Figure SMS_95
With a corresponding data set on each task
Figure SMS_101
, wherein
Figure SMS_105
Indicating the task that is currently to be learned,
Figure SMS_96
representing its corresponding data set, each data set inOne sample is recorded as
Figure SMS_99
When the text knowledge reasoning model learns on the current task, the memory set is used
Figure SMS_103
And performing a memory playback in which, among others,
Figure SMS_107
in order to operate on a union set of sets,
Figure SMS_94
the operating range being all tasks preceding the current task, i.e.
Figure SMS_98
;
Figure SMS_102
Representing a data set
Figure SMS_106
To a set of samples added to the memory set,
Figure SMS_97
is a memory set
Figure SMS_100
The abbreviation of (1); the collection
Figure SMS_104
The representativeness, the difference and the balance are met;
the representativeness, measured using Local Outlier Factor (LOF), is: because individual professional knowledge levels and expressions are different, the quality of samples describing the same knowledge point is uneven, and the description forms are various, so that the samples added into the memory set can represent the different quality and forms of all the samples describing the knowledge point corresponding to the samples;
said difference, for the sample
Figure SMS_108
Figure SMS_109
Using surface and implicit feature vectors
Figure SMS_110
Measuring the distance, and selecting a diversity sample to be put into a memory set in order to realize the robustness target of the model, namely the characteristics of the samples have difference;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
According to a preferred embodiment of the present invention, the representative specific method for measuring using Local Outlier Factor (LOF) is:
equation (VIII) is used to calculate the distance between two vectors:
Figure SMS_111
the above-mentioned
Figure SMS_112
Is two surface feature vectors or two implicit feature vectors, called vectors for short
Figure SMS_113
Vector of
Figure SMS_114
Equation (IX) for obtaining the vector
Figure SMS_115
To vector
Figure SMS_116
To (1) a
Figure SMS_117
The reachable distance is:
Figure SMS_118
wherein ,
Figure SMS_119
represented in vector space, vector
Figure SMS_120
To it's first
Figure SMS_121
Distance between close vectors;
the formula (X) is used for calculating
Figure SMS_122
Local accessible density:
Figure SMS_123
wherein ,
Figure SMS_124
is a vector of
Figure SMS_125
To (1) a
Figure SMS_126
Set of all vectors within a distance, wherein a vector
Figure SMS_127
To (1) a
Figure SMS_128
Distance refers to a vector
Figure SMS_129
To achieve the first
Figure SMS_130
The distance of the neighboring vectors;
equation (XI) for the calculation of
Figure SMS_131
Local outlier factor:
Figure SMS_132
when vector
Figure SMS_133
Is/are as follows
Figure SMS_134
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of the surrounding vectors in the vector space, and the larger the value is, the larger the current sample outlier is;
when vector
Figure SMS_135
Is/are as follows
Figure SMS_136
When the value is less than or equal to 1, the more the local reachable density of the current vector is greater than that of the surrounding vectors in the vector space, and the smaller the value is, the greater the aggregation of the current sample is;
the representativeness of the sample is obtained by integrating the local outliers of the sample in the surface characteristic space and the implicit characteristic space
Figure SMS_137
When the LOF value is smaller, the lower the LOF value,
Figure SMS_138
the larger, the more representative it is: the range of LOF values (0, + ∞), as can be seen from the above description of the cases where the LOF value is 1 or less and greater than 1, the greater the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the smaller the degree of outlier of the corresponding vector, the more likely the vector is to be normal;
the formula (XII) is a sample
Figure SMS_139
Is/are as follows
Figure SMS_140
The calculation method comprises the following steps:
Figure SMS_141
wherein ,
Figure SMS_142
is an adjustable parameter that is,
Figure SMS_143
is used to indicate the relative importance of the surface features, and is increased if the surface features are important
Figure SMS_144
(ii) a If the implicit characteristic is significant, then it is reduced
Figure SMS_145
Default value is 0.5; the division of equation (XII) is such that for different sample distributions,
Figure SMS_146
always kept in a similar range.
According to a preferred embodiment of the invention, said difference is for the sample
Figure SMS_147
Figure SMS_148
The difference between the two is measured by using the L2 distance between the surface feature vector and the implicit feature vector:
as shown in formula (XIII) (XIV), wherein
Figure SMS_149
Is a difference threshold, is an adjustable parameter, and is defaulted to be the mean of the distances between all samples:
Figure SMS_150
Figure SMS_151
the candidate sample is selected when it satisfies formula (XIII) (XIV), and the step is to determine whether there is a difference between the candidate sample and the sample in the memory set.
According to a preferred embodiment of the invention, the balance is a set of samples selected for addition to the memory set
Figure SMS_152
Characteristic distribution of (2) and raw data set
Figure SMS_153
The characteristic distribution is approximate:
Figure SMS_154
wherein ,
Figure SMS_165
representative parameter is
Figure SMS_155
The probability distribution of the memory set samples of (2),
Figure SMS_161
is composed of
Figure SMS_163
In the short-hand form of (1),
Figure SMS_167
should be considered as a whole;
Figure SMS_168
sample probability distribution of the original data set;
Figure SMS_170
Figure SMS_160
are respectively originalParameters of the probability distribution of the data set samples and the probability distribution of the memory set samples; for a probability distribution
Figure SMS_164
Accompanied by a specific parameter determining its distribution
Figure SMS_157
Is therefore represented as
Figure SMS_159
Figure SMS_158
Is the basic operation of a set operation;
Figure SMS_162
representing a sample;
Figure SMS_166
representing descriptive knowledge points
Figure SMS_169
The sample of (a);
Figure SMS_156
a set composed of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the balance definition is met.
In the above process, the related super ginseng has
Figure SMS_171
The user can use default values and can also customize the default values to meet actual service requirements. The two parts of the invention can enable the user to select the strategy suitable for the user to select the sample, the user gives the sample set, the time sequence relation between the sample sets, the strategy and the optional super parameter, and the invention can automatically select the historical sample according to the established algorithm to be added into the memory set for the subsequent training of the natural language inference model facing the professional knowledge point.
An apparatus for implementing the sample selection method is characterized by comprising: the system comprises a central vector calculation module, a sample selection model and a training module;
the center vector calculating module is used for calculating center vectors of the surface features and the implicit features according to the label information of the samples and selecting the center vectors as subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by using an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the training of the current task by utilizing the complete memory set.
A computer device for implementing the sample selection method, characterized in that: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes:
in the use stage, a user gives tasks, the precedence relationship among the tasks, the strategy and optional super parameters; and selecting a historical sample according to a sample selection method, adding the historical sample into a memory set, and using the historical sample for subsequent professional text reasoning model training, namely professional knowledge point-oriented natural language reasoning model training.
The technical advantages of the invention are as follows:
the continuous learning sample selection method facing the text knowledge inference model can give consideration to sample properties such as representativeness, balance and difference, can be better adapted to complex text inference scenes compared with a method for selecting representative samples based on a clustering center in the prior art, effectively uses a small amount of samples to approximate the distribution of original samples, and enables the model to memorize the learned knowledge on historical tasks.
The method can finely tune the text knowledge inference model according to the properties of the high-quality memory set sample and the screened sample proposed by practical problems, can better help the model to remember the previous task, and can help the model to train the current task, thereby effectively increasing the robustness of the model in practical use.
Besides, the method can also be used on other similar tasks in the natural language processing field, such as continuous learning tasks based on contextual memory playback, such as knowledge question answering and text classification.
Drawings
FIG. 1 is a diagram illustrating the relationship between tasks and data sets in text-oriented reasoning in the present invention;
FIG. 2 is a flow chart of a continuous learning history sample selection method for text knowledge reasoning according to the present invention;
FIG. 3 is a schematic diagram of a sample selection flow of the method for selecting a continuous learning history sample oriented to text knowledge inference according to the present invention;
FIG. 4 is a flow chart of the method for continuously learning current sample selection based on text knowledge inference according to the present invention.
Detailed Description
The invention is described in detail below with reference to the following examples and the accompanying drawings of the specification, but is not limited thereto.
Examples 1,
As shown in fig. 1 and fig. 2, a sample selection method for continuous learning of a text knowledge inference model includes: selecting a historical task sample and a current task sample;
(1) Wherein the selection of the historical task sample comprises the following steps:
obtaining a central vector and a selected sample, as shown in fig. 2, wherein the central vector refers to the central vectors of all samples describing the same knowledge point, and the selected sample is obtained by adding the selected sample to a memory sample set
Figure SMS_172
Performing the following steps;
when the central vector is obtained:
for the shapes of
Figure SMS_174
Using formula (I) (II) to calculate the surface feature center vector
Figure SMS_176
And implicit feature center vectors
Figure SMS_178
, wherein
Figure SMS_175
Representative data set
Figure SMS_177
Description of (A) to
Figure SMS_179
A point of knowledge
Figure SMS_180
All of the samples of (a) are,
Figure SMS_173
respectively obtaining the functions of the surface characteristic and the implicit characteristic of the text:
Figure SMS_181
in formulas (I) and (II)
Figure SMS_182
And
Figure SMS_183
is/are as follows
Figure SMS_184
The order of the center vector of the surface feature and the center vector of the implicit feature means
Figure SMS_185
Center vector of surface feature and
Figure SMS_186
implicit feature center vectors;
acquiring a surface layer feature (object feature) and an implicit feature (late feature) of a text: the surface layer features are expressed by word frequency and inverse document frequency and are recorded as
Figure SMS_187
The implicit features, expressed using the sequence-BERT vector, are noted
Figure SMS_188
, wherein
Figure SMS_189
Is the text that is coded;
when the sample is selected, the process is shown in fig. 3, and includes:
(1-1) determining the number of samples selected and added into a memory set:
the formulas (III) - (IV) are to determine how many samples describing the same knowledge point are selected and added into the memory set, and to remember the current task as the first task according to the sequence of the tasks
Figure SMS_190
Individual task, history
Figure SMS_191
The amount of samples selected in the data set of each task is
Figure SMS_192
As in formula (III), wherein
Figure SMS_193
The total amount of samples to be selected for model training, namely the sum of the memory set sample size and the current task selection sample size:
Figure SMS_194
determined by the formula (IV)
Figure SMS_195
First of a data set of a task
Figure SMS_196
The number of samples selected from the related samples of each knowledge point is
Figure SMS_197
So that the number distribution of the samples of each knowledge point extracted is consistent with that of the original data set:
Figure SMS_198
wherein
Figure SMS_199
Denotes the first
Figure SMS_200
A data set;
Figure SMS_201
is shown as
Figure SMS_202
Described in the individual data set
Figure SMS_203
A data set of individual knowledge points;
(1-2) selecting a sample:
selecting a memory set by measuring samples through representativeness, difference and balance indexes, and traversing the samples by using one of the following two schemes when selecting the samples;
scheme (1): traversing according to the distance between the sample vector and the central vector in an ascending order, wherein the vector comprises a surface feature vector and an implicit feature vector;
scheme (2): traversing each sample at equal probability randomly;
if the traversed samples meet representativeness, difference and balance, adding a memory set;
otherwise, abandoning the traversed samples, and continuing to perform the next traversal until the number of the selected samples meets the requirement;
(2) Wherein, the current task sample selection comprises:
sample representative analysis, sample difficulty analysis and sample sampling; the flow is shown in FIG. 4;
in order to reduce training cost, aiming at a current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of a total sample, so that representative and difficult samples need to be extracted from a current task data set for training a text knowledge inference model; in view of the fact that the purpose of screening samples is to finely tune a text knowledge inference model, on one hand, representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, on the other hand, a difficult sample needs to be selected, and the difficult sample can carry more information beneficial to the model, therefore, the method combines two indexes of representativeness and difficulty to screen the sample on the current task;
(2-1) sample representative analysis in the current task sample selection, comprising:
for the current task (first
Figure SMS_215
Task) data set
Figure SMS_206
Samples in a candidate set of samples of knowledge points
Figure SMS_211
Is a set of hypothesized text surface feature vectors
Figure SMS_219
To carry out
Figure SMS_222
Clustering, assigning a cluster number of
Figure SMS_220
Figure SMS_223
Can be determined according to the number of the knowledge points of the previous task, namely, the number similar to the number of the knowledge points of the previous task is selected to obtain the knowledge points according to the formula (VI)
Figure SMS_212
Cluster of a plurality of clusters
Figure SMS_216
In the current task
Figure SMS_204
A point of knowledgeThe number of samples to be extracted is
Figure SMS_208
For each cluster
Figure SMS_207
Calculating the variance of samples in a cluster
Figure SMS_210
And number of samples in a cluster
Figure SMS_214
To analyze sample representativeness: variance (variance)
Figure SMS_218
(ii) large and large-number-of-samples clusters, where each sample is less representative of the cluster, and more samples from the cluster are sampled to maintain the cluster representative, the cluster is determined according to equation (V)
Figure SMS_213
Number of samples sampled in
Figure SMS_217
,
Figure SMS_221
Means first
Figure SMS_224
In the individual task, for
Figure SMS_205
The first knowledge point
Figure SMS_209
Number of samples selected in each cluster:
Figure SMS_225
Figure SMS_226
(2-2) the sample difficulty analysis in the sample selection of the current task comprises the following steps:
professional text reasoning model with pre-training
Figure SMS_227
Inputting samples
Figure SMS_228
Is a precondition text
Figure SMS_229
And hypothesis text
Figure SMS_230
Making inference prediction to obtain class set
Figure SMS_231
Up-predicted probability distribution
Figure SMS_232
,
Figure SMS_235
Representing class label with maximum probability, using formula (VII) to calculate inference model for sample
Figure SMS_238
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution is used to measure the sample
Figure SMS_241
Difficulty (2) of
Figure SMS_234
Figure SMS_237
The smaller the representation reasoning model is, the more difficult the sample is: wherein
Figure SMS_240
Representing professional text reasoning models
Figure SMS_242
Predicting a sample class of
Figure SMS_233
The probability of (a) of (b) being,
Figure SMS_236
representation by professional text reasoning model
Figure SMS_239
Predicting the probability of the sample class being c;
Figure SMS_243
(2-3) sampling samples in the sample selection of the current task, wherein the sampling comprises the following steps:
(2-3-1) for the current task data set
Figure SMS_244
Performing sample representative analysis on the candidate sample set of the knowledge points in the step (2-1) to obtain a sample representative analysis result from each cluster
Figure SMS_245
Number of sample samples in
Figure SMS_246
To maintain the representativeness of the screened sample set;
(2-3-2) thereafter, the sample is subjected to the sample difficulty analysis described in the step (2-2), and the sample is calculated
Figure SMS_248
Of the difficulty quantization value
Figure SMS_251
For clusters
Figure SMS_253
From which sampling is most difficult
Figure SMS_249
Samples, i.e. according to the sample
Figure SMS_252
Selecting from small to large value
Figure SMS_254
Adding the selected small samples into each sample, and clustering
Figure SMS_255
All the above sampling processes are carried out to complete the knowledge points
Figure SMS_247
Sampling samples, completing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening out the training sample sets of the current task with the number of
Figure SMS_250
Examples 2,
The sample selection method for continuous learning of the text knowledge inference model as described in embodiment 1 includes, in step (1), a specific method for obtaining surface layer features of the text:
utilizing a coarse-grained word segmentation device of HanLP to segment words of the text, removing common Chinese stop words from the segmentation result, and screening out words which only appear once to obtain dictionaries of all the texts; and finally, calculating the TF-IDF feature vector of each text as the surface feature of the sample.
In the step (1), a specific method for acquiring the implicit characteristics of the text comprises the following steps:
the method comprises the steps of using a specific implementation of a Sennce-Bert model to realize a Sennce transform, loading a pre-training model parahrase-multilinual-mpnet-base-v 2 to encode text, and using an encoded Sennce-Bert feature vector to express implicit features of the text.
Examples 3,
The method for selecting the sample for the continuous learning of the text knowledge inference model as in embodiments 1 and 2 includes, in step (1-2), a method for selecting a memory set by measuring samples according to the indexes of representativeness, difference and balance, including:
for a series of tasks
Figure SMS_257
With a corresponding data set on each task
Figure SMS_260
, wherein
Figure SMS_264
Indicating the task that is currently to be learned,
Figure SMS_258
representing its corresponding data set, each sample in the data set being noted
Figure SMS_261
When the text knowledge reasoning model learns on the current task, the memory set is used
Figure SMS_265
A memory playback is performed in which, among other things,
Figure SMS_268
in order to perform the union-union operation,
Figure SMS_256
the operating range being all tasks preceding the current task, i.e.
Figure SMS_262
;
Figure SMS_266
Representing a data set
Figure SMS_269
A set of samples to which a memory set is added,
Figure SMS_259
is a memory set
Figure SMS_263
The abbreviation of (1); the collection
Figure SMS_267
Satisfy representativeness, diversity and balance;
the representativeness, measured using Local Outlier Factor (LOF), is: because individual professional knowledge levels and expressions are different, the quality of samples describing the same knowledge point is uneven, and the description forms are various, so that the samples added into the memory set can represent the different quality and forms of all the samples describing the knowledge point corresponding to the samples;
said difference, for the sample
Figure SMS_270
Using surface and implicit feature vectors
Figure SMS_271
Measuring the distance, and selecting a diversity sample to be put into a memory set in order to realize the robustness target of the model, namely the characteristics of the samples have difference;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
The representativeness is a specific method for measuring by using Local Outlier Factor (LOF):
equation (VIII) is used to calculate the distance between two vectors:
Figure SMS_272
the above-mentioned
Figure SMS_273
Is two surface feature vectors or two implicit feature vectors, called vectors for short
Figure SMS_274
Vector of
Figure SMS_275
Equation (IX) for obtaining the vector
Figure SMS_276
To vector
Figure SMS_277
To (1) a
Figure SMS_278
The reachable distance is as follows:
Figure SMS_279
wherein ,
Figure SMS_280
represented in vector space, vector
Figure SMS_281
To it's first
Figure SMS_282
Distance between close vectors;
the formula (X) is used for calculating
Figure SMS_283
Local accessible density:
Figure SMS_284
wherein ,
Figure SMS_285
is a vector of
Figure SMS_286
To (1) a
Figure SMS_287
Set of all vectors within a distance, wherein a vector
Figure SMS_288
To (1)
Figure SMS_289
Distance refers to a vector
Figure SMS_290
To achieve the first
Figure SMS_291
The distance of the neighboring vectors;
equation (XI) for the calculation of
Figure SMS_292
Local outlier factor:
Figure SMS_293
when vector
Figure SMS_294
Is/are as follows
Figure SMS_295
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of the surrounding vectors in the vector space, and the larger the value is, the larger the current sample outlier is;
when vector
Figure SMS_296
Is/are as follows
Figure SMS_297
When the value is less than or equal to 1, the more the local reachable density of the current vector is greater than the local reachable density of the surrounding vectors in the vector space, and the smaller the value is, the greater the aggregative property of the current sample is indicated;
the representativeness of the sample is obtained by integrating the local outliers of the sample in the surface characteristic space and the implicit characteristic space
Figure SMS_298
When the LOF value is smaller, the lower the LOF value,
Figure SMS_299
the larger, the more representative it is: the range of LOF values (0, + ∞), as can be seen from the above description of the cases where the LOF value is 1 or less and greater than 1, the greater the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the smaller the degree of outlier of the corresponding vector, the more likely the vector is to be normal;
the formula (XII) is a sample
Figure SMS_300
Is/are as follows
Figure SMS_301
The calculation method comprises the following steps:
Figure SMS_302
wherein ,
Figure SMS_303
is an adjustable parameter that is,
Figure SMS_304
is used to indicate the relative importance of the surface features, and is increased if the surface features are important
Figure SMS_305
(ii) a If the implicit characteristic is significant, then it is reduced
Figure SMS_306
Default value is 0.5; the division of equation (XII) is such that for different sample distributions,
Figure SMS_307
always kept in a similar range.
Said difference, for the sample
Figure SMS_308
The difference between the two is measured by using the L2 distance between the surface feature vector and the implicit feature vector:
as shown in formula (XIII) (XIV), wherein
Figure SMS_309
Is a difference threshold, is an adjustable parameter, and is defaulted to be the mean of the distances between all samples:
Figure SMS_310
Figure SMS_311
the candidate sample is selected when it satisfies formula (XIII) (XIV), and the step is to determine whether there is a difference between the candidate sample and the sample in the memory set.
The balance is selected to be added to the sample set of the memory set
Figure SMS_312
Characteristic distribution of (2) and raw data set
Figure SMS_313
The characteristic distribution is approximate:
Figure SMS_314
wherein ,
Figure SMS_324
representative parameter is
Figure SMS_317
The probability distribution of the memory set samples of (2),
Figure SMS_320
is composed of
Figure SMS_326
In the short-hand form of (1),
Figure SMS_329
should be considered as a whole;
Figure SMS_328
sample probability distribution of the original data set;
Figure SMS_330
Figure SMS_323
respectively are the parameters of the probability distribution of the original data set sample and the probability distribution of the memory set sample; for a probability distribution
Figure SMS_327
Accompanied by a specific parameter determining its distribution
Figure SMS_315
Is thus represented as
Figure SMS_319
Figure SMS_318
Is the basic operation of a collective operation;
Figure SMS_322
representing a sample;
Figure SMS_321
representing descriptive knowledge points
Figure SMS_325
The sample of (1);
Figure SMS_316
a set composed of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the balance definition is met.
In the above process, the related super ginseng has
Figure SMS_331
The user can use default values and can also customize the default values to meet actual service requirements. The two parts of the invention can enable the user to select the strategy suitable for the user to select the sample, and the user gives the sample set and the sampleThe invention can automatically select historical samples according to a set algorithm and add the historical samples into a memory set for subsequent training of a natural language reasoning model facing professional knowledge points.
Examples 5,
An apparatus for implementing the sample selection method according to embodiments 1-4, comprising: the system comprises a central vector calculation module, a sample selection model and a training module;
the center vector calculating module is used for calculating center vectors of the surface features and the implicit features according to the label information of the samples and selecting the center vectors as subsequent samples;
the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by using an optional sample selection strategy, and finally obtains a complete memory set;
and the training module is used for assisting the training of the current task by utilizing the complete memory set.
Examples 6,
A computer apparatus for implementing the sample selection method of embodiments 1-4, comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes:
in the use stage, a user gives tasks, precedence among the tasks, strategies and optional super parameters; and selecting a historical sample according to a sample selection method, adding the historical sample into a memory set, and using the historical sample for subsequent professional text reasoning model training, namely professional knowledge point-oriented natural language reasoning model training.

Claims (10)

1. A sample selection method for continuous learning of a text knowledge inference model is characterized by comprising the following steps: selecting a historical task sample and a current task sample;
(1) Wherein the selection of the historical task sample comprises the following steps: obtaining a central vector and selecting a sample;
when the central vector is obtained: for the shapes of
Figure QLYQS_3
Using formula (I) (II) to calculate the surface feature center vector
Figure QLYQS_4
And implicit feature center vectors
Figure QLYQS_6
, wherein
Figure QLYQS_2
Representative data set
Figure QLYQS_5
Description of (1)
Figure QLYQS_7
A point of knowledge
Figure QLYQS_8
All of the samples of (a) are,
Figure QLYQS_1
respectively obtaining the functions of the surface characteristic and the implicit characteristic of the text:
Figure QLYQS_9
in formulas (I) and (II)
Figure QLYQS_10
And
Figure QLYQS_11
is/are as follows
Figure QLYQS_12
The order of the center vector of the surface feature and the center vector of the implicit feature means
Figure QLYQS_13
Center vector of surface feature and
Figure QLYQS_14
implicit feature center vectors;
acquiring surface characteristics and implicit characteristics of the text: the surface layer features are expressed by word frequency and inverse document frequency and are recorded as
Figure QLYQS_15
The implicit features, expressed using the sequence-BERT vector, are noted
Figure QLYQS_16
, wherein
Figure QLYQS_17
Is coded text;
when selecting the sample, the method comprises the following steps:
(1-1) determining the number of samples selected and added into a memory set:
(1-2) selecting samples, namely selecting a memory set by measuring the samples according to the indexes of representativeness, difference and balance, traversing the samples by utilizing one of the following two schemes when selecting the samples,
scheme (1): traversing according to the distance between the sample vector and the central vector in an ascending order, wherein the vector comprises a surface feature vector and an implicit feature vector;
scheme (2): traversing each sample at equal probability randomly;
if the traversed samples meet the representativeness, the difference and the balance, adding a memory set;
otherwise, abandoning the traversed sample, and continuing to perform the next traversal until the number of the selected samples meets the requirement;
(2) The current task sample selection comprises the following steps: sample representative analysis, sample difficulty analysis and sample sampling;
(2-1) performing sample representative analysis in the current task sample selection;
(2-2) analyzing the sample difficulty in the sample selection of the current task;
and (2-3) sampling samples in the sample selection of the current task.
2. The sample selection method for continuous learning of the text-oriented knowledge inference model according to claim 1, characterized in that in step (1):
and (1-1) determining the number of samples selected and added into the memory set:
according to the sequence of the tasks, the current task is recorded as the first task
Figure QLYQS_18
Individual task, history
Figure QLYQS_19
The amount of samples selected in the data set of each task is
Figure QLYQS_20
As in formula (III), wherein
Figure QLYQS_21
The total amount of samples to be selected for model training, namely the sum of the memory set sample size and the current task selection sample size:
Figure QLYQS_22
determining the ith from the data set of the ith task by equation (IV)
Figure QLYQS_23
The number of samples selected from the related samples of each knowledge point is
Figure QLYQS_24
So that the number distribution of the samples of each knowledge point extracted is consistent with that of the original data set:
Figure QLYQS_25
wherein
Figure QLYQS_26
Is shown as
Figure QLYQS_27
A data set;
Figure QLYQS_28
is shown as
Figure QLYQS_29
Described in the individual data set
Figure QLYQS_30
A data set of individual knowledge points;
in step (2), (2-1) the sample representative analysis in the current task sample selection includes:
for the current task data set
Figure QLYQS_40
Samples in a candidate set of samples of knowledge points
Figure QLYQS_33
Set of hypothesized text surface feature vectors (SVC)
Figure QLYQS_36
To carry out
Figure QLYQS_34
Clustering, assigning a cluster number of
Figure QLYQS_35
According to the formula (VI), obtaining
Figure QLYQS_39
Cluster of a plurality of clusters
Figure QLYQS_43
In the current task
Figure QLYQS_42
The number of samples to be extracted for each knowledge point is
Figure QLYQS_46
For each cluster
Figure QLYQS_31
Calculating the variance of samples in a cluster
Figure QLYQS_38
And number of samples in a cluster
Figure QLYQS_44
To analyze sample representativeness: determining clusters according to formula (V)
Figure QLYQS_47
Number of samples sampled in
Figure QLYQS_45
,
Figure QLYQS_48
Means first
Figure QLYQS_32
In the individual task, for
Figure QLYQS_37
The first knowledge point
Figure QLYQS_41
Number of samples selected in each cluster:
Figure QLYQS_49
Figure QLYQS_50
(2-2) sample difficulty analysis in the sample selection of the current task, which comprises the following steps:
professional text reasoning model with pre-training
Figure QLYQS_51
Inputting samples
Figure QLYQS_52
Is a precondition text
Figure QLYQS_53
And hypothesis text
Figure QLYQS_54
Making inference prediction to obtain class set
Figure QLYQS_55
Up-predicted probability distribution
Figure QLYQS_56
,
Figure QLYQS_59
Representing class label with maximum probability, using formula (VII) to calculate inference model for sample
Figure QLYQS_61
The difference between the maximum output probability and the second largest output probability in the predicted probability distribution is used to measure the sample
Figure QLYQS_64
Difficulty (2) of
Figure QLYQS_58
: wherein
Figure QLYQS_60
Representing professional text reasoning models
Figure QLYQS_63
Predicting the sample class as
Figure QLYQS_65
The probability of (a) of (b) being,
Figure QLYQS_57
representation by professional text reasoning model
Figure QLYQS_62
Predicting the probability of the sample class being c;
Figure QLYQS_66
(2-3) sampling samples in the sample selection of the current task, wherein the sampling comprises the following steps:
(2-3-1) for the current task data set
Figure QLYQS_67
Performing sample representative analysis on the candidate sample set of the knowledge points in the step (2-1) to obtain a sample representative analysis result from each cluster
Figure QLYQS_68
Number of sample samples in
Figure QLYQS_69
To maintain the representativeness of the screened sample set;
(2-3-2) thereafter, the sample is subjected to the sample difficulty analysis described in the step (2-2), and the sample is calculated
Figure QLYQS_71
Of the difficulty quantization value
Figure QLYQS_73
For clusters
Figure QLYQS_76
From which sampling is most difficult
Figure QLYQS_72
Samples, i.e. according to the sample
Figure QLYQS_74
Selecting from small to large value
Figure QLYQS_77
Adding the individual samples into the small sample set to be screened, and clustering
Figure QLYQS_78
All the above sampling processes are carried out to complete the knowledge points
Figure QLYQS_70
Sampling samples, completing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening out the training sample sets of the current task with the number of
Figure QLYQS_75
3. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 1, wherein in the step (1), the method for acquiring the surface layer features of the text specifically comprises:
utilizing a coarse-grained word segmentation device of HanLP to segment words of the text, removing common Chinese stop words from the segmentation result, and screening out words which only appear once to obtain dictionaries of all the texts; and finally, calculating the TF-IDF feature vector of each text as the surface feature of the sample.
4. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 1, wherein in step (1), the implicit features of the text are obtained by a specific method comprising:
the method comprises the steps of using a specific implementation of a Sennce-Bert model to realize a Sennce transform, loading a pre-training model parahrase-multilinual-mpnet-base-v 2 to encode text, and using an encoded Sennce-Bert feature vector to express implicit features of the text.
5. The method for selecting samples for continuous learning of the text-oriented knowledge inference model according to claim 1, wherein in step (1-2), the method for selecting the memory set by measuring the samples according to the indexes of representativeness, difference and balance comprises:
the representativeness is measured by using a local outlier factor;
said difference, for the sample
Figure QLYQS_79
Using surface and implicit feature vectors
Figure QLYQS_80
Measuring the distance;
the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.
6. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 5, wherein the representativeness is measured by using a local outlier factor:
equation (VIII) is used to calculate the distance between two vectors:
Figure QLYQS_81
the described
Figure QLYQS_82
Is two surface feature vectors or two implicit feature vectors, called vectors for short
Figure QLYQS_83
Vector of
Figure QLYQS_84
Equation (IX) for obtaining the vector
Figure QLYQS_85
To vector
Figure QLYQS_86
To (1) a
Figure QLYQS_87
The reachable distance is:
Figure QLYQS_88
wherein ,
Figure QLYQS_89
represented in vector space, vector
Figure QLYQS_90
To it's first
Figure QLYQS_91
Distance between close vectors;
the formula (X) is used for calculating
Figure QLYQS_92
Local accessible density:
Figure QLYQS_93
wherein ,
Figure QLYQS_94
is a vector of
Figure QLYQS_95
To (1) a
Figure QLYQS_96
Set of all vectors within a distance, wherein a vector
Figure QLYQS_97
To (1) a
Figure QLYQS_98
Distance refers to a vector
Figure QLYQS_99
To achieve the first
Figure QLYQS_100
The distance of the neighboring vectors;
equation (XI) for the calculation of
Figure QLYQS_101
Local outlier factor:
Figure QLYQS_102
when vector
Figure QLYQS_103
Is/are as follows
Figure QLYQS_104
When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of the surrounding vectors in the vector space, and the larger the value is, the larger the current sample outlier is;
when vector
Figure QLYQS_105
Is/are as follows
Figure QLYQS_106
When the value is less than or equal to 1, the more the local reachable density of the current vector is greater than that of the surrounding vectors in the vector space, and the smaller the value is, the greater the aggregation of the current sample is;
the representativeness of the sample is obtained by integrating the local outliers of the sample in the surface characteristic space and the implicit characteristic space
Figure QLYQS_107
When the LOF value is smaller, the lower the LOF value,
Figure QLYQS_108
the larger;
the formula (XII) is a sample
Figure QLYQS_109
Is/are as follows
Figure QLYQS_110
The calculation method comprises the following steps:
Figure QLYQS_111
wherein ,
Figure QLYQS_112
is an adjustable parameter that is,
Figure QLYQS_113
and is used to indicate the relative importance of the surface features.
7. The method of claim 5, wherein the difference is a sample of the knowledge inference model learning continuously
Figure QLYQS_114
The difference between the two is measured by using the L2 distance between the surface feature vector and the implicit feature vector:
as shown in formula (XIII) (XIV), wherein
Figure QLYQS_115
Is the difference threshold:
Figure QLYQS_116
Figure QLYQS_117
candidate samples are entered into the memory set when they satisfy formula (XIII) (XIV).
8. The method for selecting samples for continuous learning of the text knowledge inference model according to claim 5, wherein the balance is selected from a sample set added to a memory set
Figure QLYQS_118
Feature distribution and raw data set of
Figure QLYQS_119
The characteristic distribution is approximate:
Figure QLYQS_120
wherein ,
Figure QLYQS_130
representative parameter is
Figure QLYQS_124
The probability distribution of the memory set samples of (2),
Figure QLYQS_126
is composed of
Figure QLYQS_132
In the short-hand form of (1),
Figure QLYQS_135
should be considered as a whole;
Figure QLYQS_134
sample probability distribution of the original data set;
Figure QLYQS_136
Figure QLYQS_129
respectively are the parameters of the probability distribution of the original data set sample and the probability distribution of the memory set sample; for a probability distribution
Figure QLYQS_133
Accompanied by a specific parameter determining its distribution
Figure QLYQS_121
Is thus represented as
Figure QLYQS_125
Figure QLYQS_123
Is the basic operation of a collective operation;
Figure QLYQS_127
representing a sample;
Figure QLYQS_128
representing descriptive knowledge points
Figure QLYQS_131
The sample of (1);
Figure QLYQS_122
is a set of knowledge points.
9. An apparatus for implementing the sample selection method according to any one of claims 1 to 8, comprising: the system comprises a central vector calculation module, a sample selection model and a training module;
the central vector calculating module is used for calculating central vectors of the surface features and the implicit features according to the label information of the samples;
selecting a proper sample and adding the sample into a memory set by the sample selection model to finally obtain a complete memory set;
and the training module is used for assisting the training of the current task by utilizing the complete memory set.
10. A computer device implementing the sample selection method of any one of claims 1-8, wherein: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes:
in the use stage, a user gives tasks, the precedence relationship among the tasks, the strategy and optional super parameters; and selecting a historical sample according to a sample selection method, adding the historical sample into a memory set, and using the historical sample for subsequent professional text reasoning model training, namely professional knowledge point-oriented natural language reasoning model training.
CN202310107542.8A 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning Active CN115829036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310107542.8A CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310107542.8A CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Publications (2)

Publication Number Publication Date
CN115829036A true CN115829036A (en) 2023-03-21
CN115829036B CN115829036B (en) 2023-05-05

Family

ID=85521149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310107542.8A Active CN115829036B (en) 2023-02-14 2023-02-14 Sample selection method and device for text knowledge reasoning model continuous learning

Country Status (1)

Country Link
CN (1) CN115829036B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US20150278200A1 (en) * 2014-04-01 2015-10-01 Microsoft Corporation Convolutional Latent Semantic Models and their Applications
CN112347268A (en) * 2020-11-06 2021-02-09 华中科技大学 Text-enhanced knowledge graph joint representation learning method and device
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN115563315A (en) * 2022-12-01 2023-01-03 东南大学 Active complex relation extraction method for continuous few-sample learning
CN115618045A (en) * 2022-12-16 2023-01-17 华南理工大学 Visual question answering method, device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US20150278200A1 (en) * 2014-04-01 2015-10-01 Microsoft Corporation Convolutional Latent Semantic Models and their Applications
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN112347268A (en) * 2020-11-06 2021-02-09 华中科技大学 Text-enhanced knowledge graph joint representation learning method and device
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training
CN115563315A (en) * 2022-12-01 2023-01-03 东南大学 Active complex relation extraction method for continuous few-sample learning
CN115618045A (en) * 2022-12-16 2023-01-17 华南理工大学 Visual question answering method, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王凌霄; 姜寅; 何联毅; 周凯: "\"Continuous-Mixture Autoregressive Networks Learning the Kosterlitz–Thouless Transition\"" *
胡正平;高文涛;万春艳;: "基于样本不确定性和代表性相结合的可控主动学习算法研究" *

Also Published As

Publication number Publication date
CN115829036B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN112784092B (en) Cross-modal image text retrieval method of hybrid fusion model
CN108399428B (en) Triple loss function design method based on trace ratio criterion
CN110490239B (en) Training method, quality classification method, device and equipment of image quality control network
CN113128369B (en) Lightweight network facial expression recognition method fusing balance loss
CN110826639B (en) Zero sample image classification method trained by full data
CN106656357B (en) Power frequency communication channel state evaluation system and method
CN113779260B (en) Pre-training model-based domain map entity and relationship joint extraction method and system
CN116821698B (en) Wheat scab spore detection method based on semi-supervised learning
CN112288137A (en) LSTM short-term load prediction method and device considering electricity price and Attention mechanism
CN115587337B (en) Method, equipment and storage medium for identifying abnormal sound of vehicle door
CN114757432A (en) Future execution activity and time prediction method and system based on flow log and multi-task learning
CN113657449A (en) Traditional Chinese medicine tongue picture greasy classification method containing noise labeling data
CN116805533A (en) Cerebral hemorrhage operation risk prediction system based on data collection and simulation
CN116257759A (en) Structured data intelligent classification grading system of deep neural network model
CN111144462A (en) Unknown individual identification method and device for radar signals
CN114511759A (en) Method and system for identifying categories and determining characteristics of skin state images
CN111553152A (en) Question generation method and device and question-text pair generation method and device
CN115829036A (en) Sample selection method and device for continuous learning of text knowledge inference model
CN116485021A (en) Coal enterprise technical skill person sentry matching prediction method and system
CN114495114B (en) Text sequence recognition model calibration method based on CTC decoder
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN114287910A (en) Brain function connection classification method based on multi-stage graph convolution fusion
CN111832942A (en) Criminal transformation quality assessment system based on machine learning
Perez et al. Face Patches Designed through Neuroevolution for Face Recognition with Large Pose Variation
CN111415276B (en) House combined renting recommendation method based on random forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant