CN115829036B

CN115829036B - Sample selection method and device for text knowledge reasoning model continuous learning

Info

Publication number: CN115829036B
Application number: CN202310107542.8A
Authority: CN
Inventors: 孙宇清; 杨磊稳; 马磊; 杨涛; 袁峰
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-05
Anticipated expiration: 2043-02-14
Also published as: CN115829036A

Abstract

A sample selection method and device for text knowledge reasoning model continuous learning belongs to the technical field of natural language reasoning and comprises historical task sample selection and current task sample selection; wherein the historical task sample selection comprises: determining the number of samples selected to be added into the memory set; selecting a sample, namely selecting a memory set by measuring the sample through representative, differential and balance indexes, and traversing the sample by using one of the following two schemes when the sample is selected; the current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling. Compared with the prior art that a representative sample is selected based on a clustering center, the method can better adapt to complex text reasoning scenes, effectively uses a small number of samples to approximate the distribution of the original samples, and enables the model to memorize the learned knowledge on the historical task.

Description

Sample selection method and device for text knowledge reasoning model continuous learning

Technical Field

The invention relates to a sample selection method and device for continuous learning of a text knowledge reasoning model, and belongs to the technical field of natural language reasoning.

Background

The natural language reasoning task refers to giving a precondition text and an assumed text, and judging different conditions such as correctness, mistakes or independence of the assumed text by taking the precondition text as a standard. Text knowledge reasoning is a special form of natural language reasoning task, where the premise text refers to knowledge points in the professional field or fact descriptions related to knowledge points in the professional field, and the hypothesis text refers to the different persons describing their understanding or cognitive results for knowledge points in the premise text. For example, in an economic examination, the precondition text refers to professional knowledge or related fact description in the field of economic law, and the reference answer of the corresponding test question is "when natural person stakeholders of a limited responsibility company change due to inheritance according to the rule of legal system of the company," other stakeholders claim to exercise priority to purchase right, people court is not supported, but the rules of the company are defined otherwise or all stakeholders agree on another rule. The "hypothesis text" refers to the understanding result of different people on the knowledge points, and corresponds to the answer of the examinee in the case, such as "Qian Mou requests to exercise limited purchase rights, and the national court is not supported. The stakeholder of the finite responsibility company inherits the equity by its inheritor. In this example, the text knowledge reasoning task is to judge whether the hypothesized text, i.e. the answer of the examinee, is correct according to the precondition text, i.e. the reference answer. The text knowledge reasoning has important application value in the fields of subjective question review, professional knowledge question and answer, knowledge reasoning and the like.

In the professional knowledge text reasoning problem, the category number of the knowledge points in the professional field is huge, the description forms of the knowledge points are various, the content and the form of the premise text are continuously updated, and the assumption text is various in form due to the close correlation with the individual professional knowledge level and the expression capability. The problems of the premise text and the hypothesis text make the sample confusion degree of describing the same knowledge point high and the identification difficult, and the low-frequency used knowledge point has a small number of corresponding samples, and the problem of lack of labeling samples exists in the cold expert knowledge point. In the face of continuously increasing knowledge point sample data, especially the knowledge points which are not involved in historical sample data, the intelligent model not only aims to solve sample challenges such as few samples and noise, but also aims to solve continuous learning challenges, namely learning new knowledge points without forgetting existing knowledge, and achieves the purposes of increasing generalization capability and robustness of the model.

Continuous learning is introduced to enable the text knowledge reasoning intelligent model to well complete new problems and process historical tasks with good performance. In the field of artificial intelligence, memory playback strategies are the most effective continuous learning methods, for example, the article published in 2019 by Wang, hong et al: "Sentence embedding alignment for lifelong relationship" arXiv preprint arXiv:1903.02588.

The aim of continuous learning is achieved by saving partial samples of the previous task to participate in the next training, wherein the set formed by the partial samples of the previous task is called a memory set, and the quality of the samples in the memory set determines the performance of the inference model on the historical task.

For example: the Chinese patent document CN114722892A provides a continuous learning method and device based on machine learning, wherein a historical data training generator is used, a pseudo sample set corresponding to a task is generated by the generator to serve as a memory set, and the quality of a generated sample is difficult to ensure by the method, so that the continuous learning effect is influenced.

The Chinese patent document CN113688882A proposes a training method and a training device of a continuous learning neural network model with enhanced memory, which are inspired by human brain memory playback, and an expandable memory module is constructed by a simple data playback method in a mode of storing the mean value and the variance of data, so that the memory enhancement effect of an original task is realized, but the scheme only considers the mode representative sample of a data set, and has no difficulty and diversity of the sample.

Chinese patent document CN113590958A discloses a continuous learning method of a sequence recommendation model based on sample playback, which samples a small portion of representative sample samples according to an item class balancing policy to generate a memory set, and this way does not consider the difficulty and the difference of the samples. In summary, the existing work is difficult to meet the continuous learning requirement of the text knowledge reasoning model.

In summary, the problems of the prior art include: aiming at the problems of various patterns and uneven quality of the sample describing the same knowledge; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category; when a sample added into a memory set is selected, the problem of high repeatability of the sample form or quality of the same knowledge point is described.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a sample selection method for continuous learning of a text knowledge reasoning model.

The invention also discloses a device for realizing the sample selection method.

Aiming at the problems, the invention provides a representative index for describing the problems of various sample patterns and uneven quality of the same knowledge sample; aiming at the problems of knowledge point category coverage and unbalanced sample number on the knowledge point category, a balance index is provided; in order to prevent the problem of high repeatability of the sample form or quality describing the same knowledge point in the samples added into the memory set, a differential index is provided. Based on the method, various selection strategies and sample selection technologies which give consideration to sample quality and sample feature distribution are provided, the model performance and robustness of continuous learning of professional knowledge text reasoning are improved, and the method has theoretical significance for other text understanding tasks.

Interpretation of technical terms

1. Expertise: references to the professional fields such as finance, law, accounting, etc., refer to the text of theory, technology, concepts, descriptions of facts, etc., as distinguished from the general knowledge and common sense knowledge.

2. Expertise point: the minimum composition unit of the professional knowledge adopts a normalized text description form, and is hereinafter referred to as knowledge points

。

3. Precondition text: refers to a domain knowledge point or a description of facts relating to a domain knowledge point. The same knowledge point can have a plurality of premise texts for description, which is recorded as

。

4. Assume text: refers to a textual description of the results of the understanding of the knowledge points of the expert by different persons. A precondition text may have a plurality of corresponding hypothesized texts, noted as

。

5. Tasks: in model continuous learning, learning is performed from a series of tasks that have a time-series relationship, and model learning is performed on each task individually.

6. Data set: each task has its own data set, each sample of the data set

Is shaped like +.>

Of (2) wherein->

For precondition text, < >>

To assume text, ++>

Wherein 0, 1, 2 respectively represent that the sample tag is implied, contradictory or neutral. The relationship of tasks to datasets is shown in figure 1.

7. Sentence-Bert model: refers to the model described in literature Reimers N, gurlev ch I, sentence-bert: sentence embeddings using siamesebert-networks [ J ]. ArXiv preprint arXiv:1908.10084, 2019.

8. Sentence transducer: is a code implementation of the Sentence-Bert model using the pytorch framework of python, currently without chinese translation.

The detailed technical scheme of the invention is as follows:

a sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;

(1) The historical task sample selection comprises the following steps:

acquiring a center vector and a selected sample, wherein the center vector refers to the center vector of all samples describing the same knowledge point, and the selected sample is a selected proper sample added into a memory sample set

In (a) and (b);

when the center vector is obtained:

for the shape like

Is calculated by using the formula (I) (II)>

And implicit feature center vector +.>

, wherein />

Representative dataset +.>

Description of the invention->

Knowledge points

Is>

Respectively, a function of acquiring surface layer characteristics and implicit characteristics of a text:

In the formulas (I) (II)

and />

Is->

Representing the order of the surface feature center vector and the implicit feature center vector, meaning +.>

Individual surface feature center vectors and +.>

Implicit feature center vectors;

acquiring a surface layer feature (object feature) and an implicit feature (text feature) of a text: the surface features, expressed using word frequency and inverse document frequency, are noted as

The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>

, wherein />

Is the encoded text;

when the sample is selected, the process comprises the following steps:

(1-1) determining the number of samples selected for addition to the memory set:

the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected and added into the memory set, and recording the current task as the first task according to the sequence of the tasks

Task, history->

The sample size selected in the data set of the individual tasks is +.>

As formula (III), wherein ∈>

The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task:

determining the ith from the data set of the ith task by formula (IV)

The number of samples selected from the relevant samples of the knowledge points is +.>

Such that the distribution of the number of samples of each knowledge point extracted is consistent with the original dataset:

/>

wherein

Indicate->

A plurality of data sets; />

Indicate->

Description of the data set->

A dataset of individual knowledge points;

(1-2) taking samples:

selecting a memory set by measuring samples according to representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when the samples are selected;

scheme (1): traversing according to the distance ascending sequence of the sample vector and the center vector, wherein the vector comprises a surface layer characteristic vector and an implicit characteristic vector;

scheme (2): each sample is traversed randomly with equal probability;

if the traversed sample meets the representativeness, the variability and the balance, adding a memory set;

otherwise, discarding the traversed samples, and continuing the next traversal until the number of the selected samples meets the requirement;

(2) The current task sample selection comprises the following steps:

sample representative analysis, sample difficult analysis, and sample sampling; the flow is as follows;

in order to reduce training cost, for the current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of the overall samples, so that representative and difficult samples need to be extracted from the current task data set for training of a text knowledge reasoning model; in view of the purpose of screening samples to fine tune a text knowledge reasoning model, on one hand, the representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, and on the other hand, a difficult sample needs to be selected, and can carry more information which can benefit the model, so that the invention combines the representativeness and the difficulty to perform sample screening on the current task;

(2-1) sample representative analysis in the current task sample selection, comprising:

for the current task (th

Personal task) data set>

Samples in a candidate sample set of knowledge points

Is a set of hypothesized text surface feature vectors +.>

Go->

Clustering, designating the number of clusters as +.>

，/>

Can be determined according to the number of knowledge points of the previous task, namely, the number similar to the number of knowledge points of the previous task is selected, and the +.>

Personal clusters->

The present task is +.>

The number of samples to be extracted for each knowledge point is +.>

For each cluster +.>

Calculating sample variance in clusters>

Sample number in cluster +.>

To analyze sample representatives: variance->

A cluster of large and large sample number, wherein each sample is less representative of the cluster, more samples need to be sampled from the cluster to maintain the representative of the cluster, determining the cluster according to formula (V)>

Sample number of middle samples +.>

,/>

Meaning +.>

In the task, for->

First->

The number of samples selected from the clusters:

(2-2) sample difficulty analysis in sample selection for the current task, comprising:

by means of pre-trained professional text reasoning models

Input sample->

Precondition text +. >

And assume text +.>

Carrying out reasoning prediction to obtain a category set +.>

Upper predictive probability distribution->

,

, wherein />

Class label representing maximum probability, calculating inference model for sample ++using formula (VII)>

The difference between the maximum output probability and the second largest output probability in the predicted probability distribution measures the sample +.>

Difficulty (S)>

,

，/>

The smaller the expression inference model is, the more difficult the sample is, because the inference model is not confidence in its predicted outcome, where +.>

Representing a professional text reasoning model->

Prediction sample category +.>

Probability of->

Representation is made of a specialized textual inference model

Predicting the probability of the sample category c;

(2-3) sample sampling in the sample selection of the current task, including:

(2-3-1) for the current task dataset

The candidate sample set of the knowledge points is subjected to the step (2-1)Representative analysis of the samples, resulting in +.>

Sample number +.>

To maintain a representative of the screened sample set;

(2-3-2) after the sample is subjected to the sample difficulty analysis described in the step (2-2), the sample is calculated

Difficulty quantified value of->

For cluster->

Sampling from them the most difficult +.>

Samples, i.e. according to +.>

The value is chosen from small to large +. >

Adding the sample into small sample set, and adding the sample into the sample set to be screened>

All the above sampling processes are carried out to finish knowledge point +.>

Sampling samples, finishing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening the training sample set number of the current task to be +.>

。

According to the invention, in the step (1), the specific method for acquiring the surface features of the text is as follows:

the text is segmented by utilizing a coarse granularity word segmentation device of the HanLP, common Chinese stop words are removed from a word segmentation result, and words which only appear once are screened out, so that a dictionary of all the texts is obtained; and finally, calculating TF-IDF characteristic vectors of each text as surface layer characteristics of the sample.

According to the present invention, in step (1), the implicit feature specific method for acquiring the text is preferably as follows:

a concrete implementation of the Sentence-Bert model is used, a Sentence transform is used, a pre-training model, namely a paramagnase-multilangual-mpnet-base-v 2, is loaded to encode the text, and the encoded Sentence-Bert feature vector is used for representing the implicit feature of the text.

Preferably, in step (1-2), the method for selecting the memory set by measuring the sample through the representative, differential and balance indexes comprises the following steps:

For a series of tasks

With corresponding data sets on each task

, wherein />

Representing the task currently to be learned, +.>

Representing its corresponding dataset, each sample in the dataset being marked +.>

When the text knowledge reasoning model learns on the current task, the memory set +.>

Performing memory playback, wherein->

For the set union operation, +.>

The operation scope is all tasks before the current task, i.e. +.>

;/>

Representing data set +.>

A set of samples added to the memory set, < >>

Is memory set->

Is abbreviated as (1); said collection->

Satisfying representativeness, variability, and balance;

the representatives, measured using local outlier factors (Local Outlier Factor, LOF), are: because the individual expertise levels and expressions are different, the quality of the samples describing the same knowledge point is uneven and the description forms are various, so that the samples added into the memory set can represent the different quality and the form of all the samples describing the knowledge point corresponding to the samples;

the difference, for the sample

、/>

Difference between using surface layer feature vector and implicit feature vector +.>

The distance is measured, and in order to achieve the model robustness target, a variety of samples are selected to be put into a memory set I.e., there is a difference between the sample characteristics;

the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.

According to a preferred embodiment of the present invention, the representative, specific method of measurement using Local outlier factor (Local OutlierFactor, LOF):

equation (VIII) is used to calculate the distance between two vectors:

the said

Is two surface layer feature vectors or two implicit feature vectors, abbreviated as vector +.>

Vector->

；

Equation (IX) is used to derive a vector

To vector->

Is>

The distance can be reached:

wherein ,

expressed in vector space, vector +.>

And (4) the first->

Distance between the near vectors;

equation (X) is used to calculate the first

Local reachable density:

wherein ,

is the pointing quantity->

Is>

A set of all vectors within a distance, wherein the vector +.>

Is>

Distance means vector +.>

Achieve->

Distance of the individual neighboring vectors;

equation (XI) is used to calculate the first

Local outlier factor:

when vector is

Is->

When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of surrounding vectors, and the current sample outlier is larger when the value is greater;

When vector is

Is->

When the value is less than or equal to 1, the local reachable density of the current vector is larger than the local reachable density of surrounding vectors, and the smaller the value is, the larger the current sample aggregation is;

obtaining the representativeness of the sample by integrating the local outlier condition of the sample in the surface layer characteristic space and the implicit characteristic space

When the LOF value is smaller, +.>

The larger it is, the more representative is: from the above description of the cases where the LOF value is 1 or less and 1 or more, it is known that the larger the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the less outliers the corresponding vector, the more likely the vector is normal;

formula (XII) is a sample

Is->

The calculation mode is as follows:

wherein ,

is an adjustable parameter, ++>

For indicating the relative importance of the surface features, if the surface features are important, increasing +.>

The method comprises the steps of carrying out a first treatment on the surface of the If the implicit feature is important, then decrease +.>

Default value is 0.5; the division of formula (XII) is to divide the sample distribution by +.>

Always remain in a similar range.

According to a preferred embodiment of the present invention, the variability, for a sample

、/>

The difference between the two is measured by using the L2 distance of the surface layer characteristic vector and the implicit characteristic vector, and the specific method is as follows:

As shown in formula (XIII) (XIV) in which

Is a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples:

the candidate sample is selected for memory set when it satisfies equation (XIII) (XIV), which is a step of determining whether there is a difference between the candidate sample and the sample in memory set.

According to a preferred embodiment of the present invention, the balance is selected from a set of samples added to the memory set

Characteristic distribution and raw data set->

The feature distribution is approximate:

wherein ,

representative parameter is->

Is a memory set sample probability distribution +.>

Is->

Shorthand for->

Should be considered as a whole; />

Probability distribution for the original data set samples; />

，/>

Parameters of the original data set sample probability distribution and the memory set sample probability distribution are respectively; for a probability distribution->

Accompanying a specific parameter which determines its distribution>

Therefore, expressed as +.>

；/>

Is the basic operation of collective operations; />

Representing the sample; />

Representing descriptive knowledge points->

Is a sample of (2); />

A set of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the definition of balance is met.

In the above process, the super ginseng is

The user can use the default value and can also customize to meet the actual business requirements. The invention can automatically select the history samples to be added into the memory set according to the established algorithm for the subsequent training of the expertise point-oriented natural language reasoning model.

An apparatus for implementing the sample selection method, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;

the center vector calculating module is used for calculating the center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples and used for selecting the subsequent samples;

the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by utilizing an optional sample selection strategy, and finally obtains a complete memory set;

and the training module is used for assisting the current task training by utilizing the complete memory set.

A computer device for implementing the sample selection method, characterized in that: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:

in the using stage, a user gives tasks, the precedence relationship among the tasks, and strategies and optional super-parameters; and then selecting a history sample in the sample selection method, adding the history sample into a memory set, and using the memory set for subsequent professional text reasoning model training, namely training the natural language reasoning model for professional knowledge points.

The invention has the technical advantages that:

compared with the prior art that the representative sample is selected based on the clustering center, the continuous learning sample selection method for the text knowledge reasoning model can be better adapted to complex text reasoning scenes, and a small number of samples are effectively used for approximating the distribution of the original samples, so that the model memorizes the learned knowledge on the historical task.

According to the invention, the text knowledge reasoning model can be finely tuned and optimized according to the properties and the screened samples of the high-quality memory set sample provided by the practical problem, the model can be better helped to memorize the existing task, and meanwhile, the model is helped to train the current task, so that the robustness of the model in practical use is effectively increased.

In addition, the invention can also be used on other similar tasks in the field of natural language processing, such as continuous learning tasks based on scene memory playback, such as knowledge questions and answers, text classification and the like.

Drawings

FIG. 1 is a schematic diagram of task versus dataset for text-oriented knowledge reasoning in accordance with the present invention;

FIG. 2 is a flow chart of a method for selecting a continuous learning history sample based on text knowledge reasoning;

FIG. 3 is a schematic diagram of a sample selection flow of a text knowledge reasoning-oriented continuous learning history sample selection method in accordance with the present invention;

fig. 4 is a flow chart of a method for continuously learning a current sample selection, which is directed to text knowledge reasoning in the present invention.

Detailed Description

The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.

Example 1,

As shown in fig. 1 and fig. 2, a sample selection method for continuous learning of a text knowledge inference model includes: historical task sample selection and current task sample selection;

(1) The historical task sample selection comprises the following steps:

the center vector is obtained by adding the selected sample to the memory sample set, and the selected sample is obtained by obtaining the center vector of all samples describing the same knowledge point, as shown in FIG. 2

In (a) and (b);

when the center vector is obtained:

for the shape like

Is calculated by using the formula (I) (II)>

And implicit feature center vector +.>

, wherein />

Representative dataset +.>

Description of the invention->

Knowledge Point->

Is>

In the formulas (I) (II)

and />

Is->

Individual surface feature center vectors and +.>

Implicit feature center vectors;

The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>

, wherein />

Is the encoded text;

when the sample is selected, the flow is as shown in fig. 3, and includes:

(1-1) determining the number of samples selected to be added to the memory set:

the formulas (III) - (IV) are used for determining how many samples describing the same knowledge point are selected to be added into the memory set, and recording the current task as the first task according to the sequence of the tasks

Task, history->

The sample size selected in the data set of the individual tasks is +.>

As formula (III), wherein ∈>

determined from equation (IV)

Task data set->

wherein

Indicate->

A plurality of data sets; />

Indicate->

Description of the data set->

A dataset of individual knowledge points;

(1-2) taking samples:

scheme (2): each sample is traversed randomly with equal probability;

(2) The current task sample selection comprises the following steps:

sample representative analysis, sample difficult analysis, and sample sampling; the flow is shown in fig. 4;

for the current task (th

Personal task) data set>

Samples in a candidate sample set of knowledge points

Is a set of hypothesized text surface feature vectors +.>

Go->

Clustering, designating the number of clusters as +.>

，/>

Personal clusters->

The present task is +.>

The number of samples to be extracted for each knowledge point is +.>

For each cluster +.>

Calculating sample variance in clusters>

Sample number in cluster +.>

To analyze sample representatives: variance->

Sample number of middle samples +.>

,/>

Meaning +.>

In the task, for->

First->

The number of samples selected from the clusters:

by means of pre-trained professional text reasoning models

Input sample->

Precondition text +. >

And assume text +.>

Carrying out reasoning prediction to obtain a category set +.>

Upper predictive probability distribution->

,

Difficulty (S)>

，/>

The smaller the representation the inference model is not confidence in its predicted outcome, so the more difficult the sample is: wherein->

Representing a professional text reasoning model->

Prediction sample category +.>

Probability of->

Representing the text reasoning model by profession->

Predicting the probability of the sample category c;

(2-3) sample sampling in the sample selection of the current task, including:

(2-3-1) for the current task dataset

Performing the sample representative analysis described in step (2-1) on the candidate sample set of knowledge points to obtain from each cluster +.>

Sample number +.>

To maintain a representative of the screened sample set;

Difficulty quantified value of->

For cluster->

Sampling from them the most difficult +.>

Samples, i.e. according to +.>

The value is chosen from small to large +. >

Adding the sample into small sample set, and adding the sample into the sample set to obtain the sample>

All the above sampling processes are carried out to finish knowledge point +.>

。

EXAMPLE 2,

The embodiment 1 of a sample selection method for continuous learning of a text knowledge-oriented reasoning model, in step (1), a specific method for obtaining surface features of the text is as follows:

In the step (1), the specific method for acquiring the implicit characteristic of the text comprises the following steps:

EXAMPLE 3,

The method for selecting a memory set by measuring samples for representative, differential and balance indexes in step (1-2) according to the sample selection method for continuous learning of a text knowledge inference model in embodiments 1 and 2 comprises the following steps:

For a series of tasks

With corresponding data sets on each task

, wherein />

Representing the task currently to be learned, +.>

Performing memory playback, wherein->

For the set union operation, +.>

The operation scope is all tasks before the current task, i.e. +.>

;/>

Representing data set +.>

A set of samples added to the memory set, < +.>

Is memory set->

Is abbreviated as (1); said collection->

Satisfying representativeness, variability, and balance;

the difference, for the sample

The distance is measured, in order to achieve the model robustness target, a plurality of samples are selected to be put into a memory set, namely, the sample features have differences;

The representative, specific method of measurement using local outlier factors (Local Outlier Factor, LOF):

equation (VIII) is used to calculate the distance between two vectors:

the said

Vector->

；

Equation (IX) is used to derive a vector

To vector->

Is>

The distance can be reached:

wherein ,

expressed in vector space, vector +.>

And (4) the first->

Distance between the near vectors;

equation (X) is used to calculate the first

Local reachable density:

wherein ,

is the pointing quantity->

Is>

A set of all vectors within a distance, wherein the vector +.>

Is>

Distance means vector +.>

Achieve->

Distance of the individual neighboring vectors;

equation (XI) is used to calculate the first

Local outlier factor:

when vector is

Is->

When vector is

Is->

When the LOF value is smaller, +.>

formula (XII) is a sample

Is->

The calculation mode is as follows:

wherein ,

is an adjustable parameter, ++>

Always remain in a similar range.

The difference, for the sample

As shown in formula (XIII) (XIV) in which

Is a difference threshold, which is an adjustable parameter, defaulting to the average of the distances between all samples: />

The balance, sample set selected to be added to the memory set

Characteristic distribution and raw data set->

The feature distribution is approximate:

wherein ,

representative parameter is->

Is a memory set sample probability distribution +.>

Is->

Shorthand for->

Should be considered as a whole; />

Probability distribution for the original data set samples; />

，/>

Accompanying a specific parameter which determines its distribution>

Therefore, expressed as +.>

；/>

Is the basic operation of collective operations; />

Representing the sample; />

Representing descriptive knowledge points->

Is a sample of (2); />

In the above process, the super ginseng is

EXAMPLE 5,

An apparatus for implementing the sample selection method of embodiments 1-4, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;

EXAMPLE 6,

A computer apparatus implementing the sample selection method of embodiments 1-4, comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following:

Claims

1. A sample selection method for continuous learning of a text knowledge reasoning model is characterized by comprising the following steps: historical task sample selection and current task sample selection;

(1) The historical task sample selection comprises the following steps: obtaining a center vector and selecting a sample;

when the center vector is obtained: for the shape like

Is calculated by using the formula (I) (II)>

And implicit feature center vector +.>

, wherein />

Representative dataset +.>

Description of the invention->

Knowledge Point->

Is>

in the formulas (I) (II)

and />

Is->

Individual surface feature center vectors and +.>

Implicit feature center vectors;

acquiring surface layer characteristics and implicit characteristics of a text: the surface features, expressed using word frequency and inverse document frequency, are noted as

The implicit feature, expressed using the Sentence-BERT vector, is denoted +.>

, wherein />

Is the encoded text;

the sample selecting process includes:

(1-1) determining the number of samples selected to be added to the memory set:

(1-2) selecting samples, namely selecting a memory set by measuring samples by using representative, differential and balance indexes, and traversing the samples by using one of the following two schemes when selecting the samples,

scheme (2): each sample is traversed randomly with equal probability;

(2) The current task sample selection comprises the following steps: sample representative analysis, sample difficult analysis, and sample sampling;

(2-1) sample representative analysis in the current task sample selection;

(2-2) sample difficulty analysis in sample selection of the current task;

(2-3) sampling samples in the sample selection of the current task;

in step (1):

the (1-1) determines the number of samples selected to be added into the memory set:

recording the current task as the first according to the sequence of the tasks

Task, history- >

The sample size selected in the data set of the individual tasks is +.>

As formula (III), wherein ∈>

The total sample size required to be selected for model training, namely the sum of the memory set sample size and the sample size selected by the current task: />

Determining the ith from the data set of the ith task by formula (IV)

wherein

Indicate->

A plurality of data sets; />

Indicate->

Description of the data set->

A dataset of individual knowledge points;

in step (2), the sample representative analysis in the current task sample selection (2-1) includes:

for the current task dataset

Sample in candidate sample set of knowledge points +.>

Is a set of hypothesized text surface feature vectors +.>

Go->

Clustering, designating the number of clusters as +.>

According to formula (VI), get +.>

Personal clusters->

The present task is +.>

The number of samples required to be extracted from each knowledge point is

For each cluster +.>

Calculating sample variance in clusters>

Number of samples in cluster

To analyze sample representatives: determining the cluster according to formula (V)>

Sample number of middle samples +.>

,/>

Meaning of the first

In the task, for->

First->

The number of samples selected from the clusters:

by means of pre-trained professional text reasoning models

Input sample->

Precondition text +.>

And assume text +.>

Carrying out reasoning prediction to obtain a category set +.>

Upper predictive probability distribution->

,

Difficulty (S)>

： wherein />

Representing a professional text reasoning model->

Prediction sample category +.>

Probability of->

Representing the text reasoning model by profession->

Predicting the probability of the sample category c;

(2-3) sample sampling in the sample selection of the current task, including:

(2-3-1) for the current task dataset

Sample number +.>

To maintain a representative of the screened sample set;

Difficulty quantified value of->

For cluster->

Sampling from them the most difficult +.>

Samples, i.e. according to +.>

The value is chosen from small to large +.>

All the above sampling processes are carried out to complete knowledge points

Sampling the sample, and completing the current process of all knowledge points in the current taskData screening on task, finally screening training sample set number of current task to be +.>

；

In step (1-2), a method for selecting a memory set from a representative, differential, balance index measurement sample, comprising:

the representativeness is measured using a local outlier factor;

the difference, for the sample

Measuring the distance;

2. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein in step (1), the specific method for obtaining the surface features of the text is as follows:

3. The sample selection method for continuous learning of a text-oriented knowledge reasoning model according to claim 1, wherein in step (1), the implicit feature concrete method for obtaining the text is as follows:

4. The sample selection method for continuous learning of a text knowledge-oriented inference model according to claim 1, wherein the representative, specific method for measuring by using local outlier factors:

equation (VIII) is used to calculate the distance between two vectors:

the said

Vector->

；

Equation (IX) is used to derive a vector

To vector->

Is>

The distance can be reached:

wherein ,

expressed in vector space, vector +.>

And (4) the first->

Distance between the near vectors;

equation (X) is used to calculate the first

Local reachable density:

wherein ,

is the pointing quantity->

Is>

A set of all vectors within a distance, wherein the vector +.>

Is>

Distance means vector +.>

Achieve->

Distance of the individual neighboring vectors;

equation (XI) is used to calculate the first

Local outlier factor:

when vector is

Is->

when vector is

Is->

When the LOF value is smaller, +.>

The larger;

formula (XII) is a sample

Is->

The calculation mode is as follows:

wherein ,

is an adjustable parameter, ++>

For representing the relative importance of the surface features.

5. The sample selection method for text knowledge oriented reasoning model continuous learning of claim 1 wherein the variability, for samples

as shown in formula (XIII) (XIV) in which

Is the difference threshold:

candidate samples satisfy equation (XIII) (XIV) and are entered into the selection memory set.

6. The method for sample selection for text-oriented knowledge reasoning model continuous learning as claimed in claim 1, wherein said balance is selected from a set of samples added to a set of memories

Characteristic distribution and raw data set->

The feature distribution is approximate:

wherein ,

representative parameter is->

Is a memory set sample probability distribution +.>

Is->

Shorthand for->

Should be considered as a whole; />

Probability distribution for the original data set samples; />

，/>

Accompanied by a specific determination of its distributionParameter->

Therefore, expressed as +.>

；/>

Is the basic operation of collective operations; />

Representing the sample; />

Representing descriptive knowledge points->

Is a sample of (2); />

Is a set of knowledge points.

7. An apparatus for implementing the sample selection method of any one of claims 1-6, comprising: the system comprises a center vector calculating module, a sample selecting module and a training module;

The center vector calculating module is used for calculating center vectors of the surface layer characteristics and the implicit characteristics according to the label information of the samples;

the sample selection model selects a proper sample and adds the proper sample into the memory set, and finally a complete memory set is obtained;

8. A computer device implementing the sample selection method of any of claims 1-6, characterized by: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following: