CN115829036A

CN115829036A - Sample selection method and device for continuous learning of text knowledge inference model

Info

Publication number: CN115829036A
Application number: CN202310107542.8A
Authority: CN
Inventors: 孙宇清; 杨磊稳; 马磊; 杨涛; 袁峰
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-03-21
Anticipated expiration: 2043-02-14
Also published as: CN115829036B

Abstract

A sample selection method and a device for continuous learning of a text knowledge inference model belong to the technical field of natural language inference and comprise selection of historical task samples and selection of current task samples; wherein the selection of the historical task sample comprises the following steps: determining the number of samples selected to be added into a memory set; selecting a sample, namely selecting a memory set by measuring the sample through representativeness, difference and balance indexes, and traversing the sample by utilizing one of the following two schemes when selecting the sample; wherein, the current task sample selection comprises: sample representative analysis, sample difficulty analysis, and sample sampling. The method can give consideration to sample properties such as representativeness, balance and difference, can better adapt to a complex text reasoning scene compared with a method for selecting representative samples based on a clustering center in the prior art, effectively uses a small amount of samples to approximate the distribution of original samples, and enables a model to memorize the learned knowledge on a historical task.

Description

Sample selection method and device for continuous learning of text knowledge inference model

Technical Field

The invention relates to a sample selection method and a sample selection device for continuous learning of a text knowledge inference model, and belongs to the technical field of natural language inference.

Background

The natural language reasoning task refers to giving a precondition text and a hypothesis text, taking the precondition text as a standard, and judging different situations of correctness, mistakes, independence and the like of the hypothesis text. Textual knowledge reasoning is a special form of natural language reasoning tasks, where prerequisite text refers to knowledge points in or description of facts related to knowledge points in a domain of expertise, and hypothesis text refers to different people describing their understanding or cognitive results to knowledge points in prerequisite text. For example, in the test of economic law, the precondition text refers to professional knowledge in the field of economic law or description of related facts, and the reference answer corresponding to the test question is shown as "when the natural stockholders of the company with limited responsibility change due to inheritance according to the provisions of the legal system of the company, other stockholders claim to exercise priority purchasing right, and the national court is not supported, except that the provision of the chapter of the company is otherwise provided or the agreement of all stockholders is otherwise provided. The "hypothesis text" refers to the understanding result of different people on the knowledge points, and corresponds to the answer of the examinee in the above case, such as "make a request to exercise limited purchasing right, and the people's court is not supported. In the case where the stockholder of the limited responsibility company comes off, the stock right is inherited by its successor. In this example, the text knowledge reasoning task is to judge whether the hypothesis text, i.e. the answer of the examinee, is correct according to the precondition text, i.e. the reference answer. The text knowledge reasoning has important application value in the fields of subjective question review, professional knowledge question answering, knowledge reasoning and the like.

In the technical knowledge text reasoning problem, the category number of the knowledge points in the technical field is huge, the description forms of the knowledge points are various, the content and the forms of the precondition text are continuously updated, and the assumed text has uneven quality and various forms because the assumed text is closely related to the individual professional knowledge level and the expression capacity. The problems of the precondition text and the hypothesis text cause that the sample describing the same knowledge point has high confusion and difficult identification, the number of the samples corresponding to the low-frequency used knowledge point is small, and the cold professional knowledge point has the problem of lacking of labeled samples. In the face of continuously increasing knowledge sample data, especially knowledge points which are not related in historical sample data, the intelligent model not only needs to solve sample challenges of few samples, noise and the like, but also needs to solve continuous learning challenges, namely, existing knowledge is not forgotten while a new knowledge point is learned, and the purpose of increasing the generalization ability and robustness of the model is achieved.

The continuous learning is introduced to enable the text knowledge reasoning intelligent model to well complete new problems and process historical tasks with good performance. In the field of artificial intelligence, a memory replay strategy is the most effective continuous learning method, for example, an article published in 2019 by Wang, hong et al: "area encoding alignment for lifting relationship extraction" arXiv preprinting arXiv:1903.02588.

The aim of continuous learning is achieved by storing partial samples of previous tasks to participate in next training, wherein a set formed by the partial samples of the previous tasks is called a memory set, and the quality of the samples in the memory set determines the performance of the inference model on historical tasks.

For example: chinese patent document CN114722892A provides a continuous learning method and device based on machine learning, which uses a historical data training generator, and uses the generator to generate a pseudo sample set of corresponding tasks as a memory set, and this method is difficult to ensure the quality of generated samples and affects the continuous learning effect.

Chinese patent document CN113688882A proposes a training method and device for a memory-enhanced continuous learning neural network model, which is inspired by human brain memory playback, utilizes a simple data playback method to construct an expandable memory module by storing mean values and variances of data, and realizes the memory enhancement effect on the original task.

Chinese patent document CN113590958A discloses a continuous learning method of a sequence recommendation model based on sample playback, which samples a small portion of representative sample samples according to an item class balancing policy to generate a memory set, and this method does not consider the difficulty and difference of the sample. In conclusion, the existing work is difficult to meet the continuous learning requirement of the text knowledge reasoning model.

In summary, the problems of the prior art include: aiming at the problems of various forms and uneven quality of a sample application book describing the same knowledge; aiming at the problem of unbalanced sample quantity on the knowledge point category coverage and the knowledge point category; when a sample added into a memory set is selected, the problem of high repeatability of the form or quality of the sample describing the same knowledge point is solved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a sample selection method for continuous learning of a text knowledge reasoning model.

The invention also discloses a device for realizing the sample selection method.

Aiming at the problems, the invention provides a representative index aiming at the problems of various forms and uneven quality of a sample book describing the same knowledge; aiming at the problems of unbalanced sample quantity on the knowledge point category coverage and the knowledge point category, a balance index is provided; in order to prevent the problem that the form or quality repeatability of a sample describing the same knowledge point is high in the process of selecting the sample added into a memory set, a difference index is provided. Based on the method, the invention also provides a plurality of selection strategies and sample selection technologies which give consideration to the sample quality and the sample characteristic distribution, improves the model performance and robustness of the continuous learning of the professional knowledge text reasoning, and has theoretical significance for other text understanding tasks.

Interpretation of professional terms

1. Professional knowledge: the technical fields such as finance, law, accounting and the like refer to theories, technologies, concepts, fact descriptions and other texts, and are different from common knowledge and common knowledge.

2. Professional knowledge points: minimum units of professional knowledge, using compassesNormalized text description form, hereinafter referred to as knowledge points, is denoted

。

3. A precondition text: refers to a point of expertise or a description of facts relating to a point of expertise. Multiple precondition texts may be described for the same knowledge point and recorded as

。

4. Assume that the text: refers to a textual description of the results of different people's understanding of the points of knowledge in the field of expertise. A precondition text may have multiple corresponding hypothesis texts, which are recorded as

。

5. Task: in the model continuous learning process, learning is carried out from a series of tasks, the tasks have a time sequence relation, and the model learning is carried out on each task independently.

6. Data set: each task has its own data set, each sample of the data set

Is in the shape of

Of a tuple of (1), wherein

In order to precondition the text, the user can,

in order to assume a text, the user may,

wherein 0, 1 and 2 respectively represent that the sample label is inclusion, contradiction or neutral. The relationship of tasks to datasets is shown in FIG. 1.

7. The sequence-Bert model: refers to the model described in the literature Reimers N, gurevych I. Sennce-bert: sennce embedding parameter-networks [ J ]. ArXiv preprint arXiv:1908.10084, 2019.

8. SenterTransformer: is a code implementation of the sequence-Bert model using the pytorech framework of python, and currently there is no Chinese translation.

The detailed technical scheme of the invention is as follows:

a sample selection method for continuous learning of a text knowledge inference model is characterized by comprising the following steps: selecting a historical task sample and selecting a current task sample;

(1) Wherein the selection of the historical task sample comprises the following steps:

obtaining a central vector and a selected sample, wherein the central vector refers to the central vectors of all samples describing the same knowledge point, and the selected sample is a proper sample selected to be added into a memory sample set

Performing the following steps;

when the central vector is obtained:

for the shapes of

Using formula (I) (II) to calculate the surface feature center vector

And implicit feature center vectors

, wherein

Representative data set

Description of (A) to

A point of knowledge

All of the samples of (a) are,

respectively obtaining the functions of the surface characteristic and the implicit characteristic of the text:

in formulas (I) and (II)

And

is/are as follows

The order of the center vector of the surface feature and the center vector of the implicit feature means

Center vector of surface feature and

implicit feature center vectors;

acquiring a surface layer feature (object feature) and an implicit feature (late feature) of a text: the surface layer features are expressed by word frequency and inverse document frequency and are recorded as

The implicit features, expressed using the sequence-BERT vector, are noted

, wherein

Is the text that is coded;

when selecting the sample, the process comprises:

(1-1) determining the number of samples selected to be added into the memory set:

the formulas (III) - (IV) are to determine how many samples describing the same knowledge point are selected to be added into a memory set, and to remember the current task as the first task according to the sequence of the tasks

Individual task, history

The amount of samples selected in the data set of each task is

As in formula (III), wherein

The total amount of samples to be selected for model training, namely the sum of the memory set sample size and the current task selection sample size:

determining the ith from the data set of the ith task by equation (IV)

The number of samples selected from the related samples of each knowledge point is

So that the number distribution of the samples of each knowledge point extracted is consistent with that of the original data set:

wherein

Is shown as

A data set;

is shown as

Described in a data set

A data set of individual knowledge points;

(1-2) selecting a sample:

selecting a memory set by measuring samples through representativeness, difference and balance indexes, and traversing the samples by utilizing one of the following two schemes when selecting the samples;

scheme (1): traversing according to the distance between the sample vector and the central vector in an ascending order, wherein the vector comprises a surface feature vector and an implicit feature vector;

scheme (2): traversing each sample at equal probability randomly;

if the traversed samples meet the representativeness, the difference and the balance, adding a memory set;

otherwise, abandoning the traversed samples, and continuing to perform the next traversal until the number of the selected samples meets the requirement;

(2) Wherein, the current task sample selection comprises:

sample representative analysis, sample difficulty analysis and sample sampling; the process is as follows;

in order to reduce training cost, aiming at a current task, a very small number of samples are expected to be selected for training, but the small number of samples are difficult to represent the characteristics of a total sample, so that representative and difficult samples need to be extracted from a current task data set for training a text knowledge inference model; in view of the fact that the purpose of screening samples is to finely tune a text knowledge inference model, on one hand, representativeness of a total sample needs to be met to the greatest extent under the constraint of a limited sample, on the other hand, a difficult sample needs to be selected, and the difficult sample can carry more information beneficial to the model, therefore, the method combines two indexes of representativeness and difficulty to screen the sample on the current task;

(2-1) sample representative analysis in the current task sample selection, comprising:

for the current task (first

Task) data set

Samples in a candidate set of samples of knowledge points

Is a set of hypothesized text surface feature vectors

To carry out

Clustering, assigning a cluster number of

，

Can be determined according to the number of the knowledge points of the past task, namely, the number similar to the number of the knowledge points of the past task is selected to obtain the knowledge points according to a formula (VI)

Cluster of a plurality of clusters

In the current task

The number of samples to be extracted for each knowledge point is

For each cluster

Calculating the variance of samples in a cluster

And number of samples in a cluster

To analyze sample representativeness: variance (variance)

Large and large number of samples of clusters, whichThe cluster is determined according to the formula (V) in that each sample has a low representativeness to the cluster, more samples need to be sampled from the cluster to maintain the representativeness to the cluster

Number of samples sampled in

,

Means first

In the individual task, aim at

The first knowledge point

Number of samples selected in each cluster:

(2-2) the sample difficulty analysis in the sample selection of the current task comprises the following steps:

professional text reasoning model with pre-training

Inputting samples

Is a precondition text

And hypothesis text

Making inference prediction to obtain class set

Up-predicted probability distribution

,

, wherein

Representing class label with maximum probability, using formula (VII) to calculate inference model for sample

The difference between the maximum output probability and the second largest output probability in the predicted probability distribution is used to measure the sample

Difficulty of

,

，

The smaller the representation inference model is not confident in its prediction results, the more difficult the sample is, wherein

Representing professional text reasoning models

Predicting a sample class of

The probability of (a) of (b) being,

representation by professional text reasoning model

Predicting the probability of the sample class being c;

(2-3) sampling samples in the sample selection of the current task, wherein the sampling comprises the following steps:

(2-3-1) the fourth to the current task data set

Performing sample representative analysis on the candidate sample set of the knowledge points in the step (2-1) to obtain a sample representative analysis result from each cluster

Number of sample samples in

To maintain the representativeness of the screened sample set;

(2-3-2) thereafter subjecting the sample to the sample difficulty analysis described in the step (2-2), calculating the sample

Of the difficulty quantization value

For clusters

From which sampling is most difficult

Samples, i.e. according to the sample

Selecting from small to large value

Adding the individual samples into the small sample set to be screened, and clustering

All carry out the above sampling processPoints of adult knowledge

Sampling samples, completing data screening on the current task by all knowledge points in the current task through the sampling process, and finally screening out the training sample sets of the current task with the number of

。

According to the present invention, preferably, in step (1), the method for acquiring the surface layer features of the text specifically includes:

utilizing a coarse-grained word segmentation device of HanLP to segment words of the text, removing common Chinese stop words from the segmentation result, and screening out words which only appear once to obtain dictionaries of all the texts; and finally, calculating the TF-IDF feature vector of each text as the surface feature of the sample.

According to the present invention, preferably, in step (1), the specific method for obtaining the implicit characteristic of the text comprises:

the method comprises the steps of using a specific implementation of a Sennce-Bert model to realize a Sennce transform, loading a pre-training model parahrase-multilinual-mpnet-base-v 2 to encode text, and using an encoded Sennce-Bert feature vector to express implicit features of the text.

Preferably, in step (1-2), the method for selecting the memory set by measuring the sample according to the representative, difference and balance indexes comprises the following steps:

for a series of tasks

With a corresponding data set on each task

, wherein

Indicating the task that is currently to be learned,

representing its corresponding data set, each data set inOne sample is recorded as

When the text knowledge reasoning model learns on the current task, the memory set is used

And performing a memory playback in which, among others,

in order to operate on a union set of sets,

the operating range being all tasks preceding the current task, i.e.

;

Representing a data set

To a set of samples added to the memory set,

is a memory set

The abbreviation of (1); the collection

The representativeness, the difference and the balance are met;

the representativeness, measured using Local Outlier Factor (LOF), is: because individual professional knowledge levels and expressions are different, the quality of samples describing the same knowledge point is uneven, and the description forms are various, so that the samples added into the memory set can represent the different quality and forms of all the samples describing the knowledge point corresponding to the samples;

said difference, for the sample

、

Using surface and implicit feature vectors

Measuring the distance, and selecting a diversity sample to be put into a memory set in order to realize the robustness target of the model, namely the characteristics of the samples have difference;

the balance means that all knowledge points described in the original data set are covered in the memory set, and the number of samples describing the same knowledge point in the original data set is balanced with the number of samples describing the same knowledge point in the memory set.

According to a preferred embodiment of the present invention, the representative specific method for measuring using Local Outlier Factor (LOF) is:

equation (VIII) is used to calculate the distance between two vectors:

the above-mentioned

Is two surface feature vectors or two implicit feature vectors, called vectors for short

Vector of

；

Equation (IX) for obtaining the vector

To vector

To (1) a

The reachable distance is:

wherein ,

represented in vector space, vector

To it's first

Distance between close vectors;

the formula (X) is used for calculating

Local accessible density:

wherein ,

is a vector of

To (1) a

Set of all vectors within a distance, wherein a vector

To (1) a

Distance refers to a vector

To achieve the first

The distance of the neighboring vectors;

equation (XI) for the calculation of

Local outlier factor:

when vector

Is/are as follows

When the value is greater than 1, the local reachable density of the current vector is smaller than the local reachable density of the surrounding vectors in the vector space, and the larger the value is, the larger the current sample outlier is;

when vector

Is/are as follows

When the value is less than or equal to 1, the more the local reachable density of the current vector is greater than that of the surrounding vectors in the vector space, and the smaller the value is, the greater the aggregation of the current sample is;

the representativeness of the sample is obtained by integrating the local outliers of the sample in the surface characteristic space and the implicit characteristic space

When the LOF value is smaller, the lower the LOF value,

the larger, the more representative it is: the range of LOF values (0, + ∞), as can be seen from the above description of the cases where the LOF value is 1 or less and greater than 1, the greater the LOF value, the greater the degree of outlier of the corresponding vector, and the more likely the vector is abnormal; the smaller the LOF value, the smaller the degree of outlier of the corresponding vector, the more likely the vector is to be normal;

the formula (XII) is a sample

Is/are as follows

The calculation method comprises the following steps:

wherein ,

is an adjustable parameter that is,

is used to indicate the relative importance of the surface features, and is increased if the surface features are important

(ii) a If the implicit characteristic is significant, then it is reduced

Default value is 0.5; the division of equation (XII) is such that for different sample distributions,

always kept in a similar range.

According to a preferred embodiment of the invention, said difference is for the sample

、

The difference between the two is measured by using the L2 distance between the surface feature vector and the implicit feature vector:

as shown in formula (XIII) (XIV), wherein

Is a difference threshold, is an adjustable parameter, and is defaulted to be the mean of the distances between all samples:

the candidate sample is selected when it satisfies formula (XIII) (XIV), and the step is to determine whether there is a difference between the candidate sample and the sample in the memory set.

According to a preferred embodiment of the invention, the balance is a set of samples selected for addition to the memory set

Characteristic distribution of (2) and raw data set

The characteristic distribution is approximate:

wherein ,

representative parameter is

The probability distribution of the memory set samples of (2),

is composed of

In the short-hand form of (1),

should be considered as a whole;

sample probability distribution of the original data set;

，

are respectively originalParameters of the probability distribution of the data set samples and the probability distribution of the memory set samples; for a probability distribution

Accompanied by a specific parameter determining its distribution

Is therefore represented as

；

Is the basic operation of a set operation;

representing a sample;

representing descriptive knowledge points

The sample of (a);

a set composed of knowledge points; after the sample is added, whether the formula is established or not is verified, and if the formula is established, the balance definition is met.

In the above process, the related super ginseng has

The user can use default values and can also customize the default values to meet actual service requirements. The two parts of the invention can enable the user to select the strategy suitable for the user to select the sample, the user gives the sample set, the time sequence relation between the sample sets, the strategy and the optional super parameter, and the invention can automatically select the historical sample according to the established algorithm to be added into the memory set for the subsequent training of the natural language inference model facing the professional knowledge point.

An apparatus for implementing the sample selection method is characterized by comprising: the system comprises a central vector calculation module, a sample selection model and a training module;

the center vector calculating module is used for calculating center vectors of the surface features and the implicit features according to the label information of the samples and selecting the center vectors as subsequent samples;

the sample selection model selects a proper sample to be added into the memory set according to the characteristics required to be met by the sample by using an optional sample selection strategy, and finally obtains a complete memory set;

and the training module is used for assisting the training of the current task by utilizing the complete memory set.

A computer device for implementing the sample selection method, characterized in that: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes:

in the use stage, a user gives tasks, the precedence relationship among the tasks, the strategy and optional super parameters; and selecting a historical sample according to a sample selection method, adding the historical sample into a memory set, and using the historical sample for subsequent professional text reasoning model training, namely professional knowledge point-oriented natural language reasoning model training.

The technical advantages of the invention are as follows:

the continuous learning sample selection method facing the text knowledge inference model can give consideration to sample properties such as representativeness, balance and difference, can be better adapted to complex text inference scenes compared with a method for selecting representative samples based on a clustering center in the prior art, effectively uses a small amount of samples to approximate the distribution of original samples, and enables the model to memorize the learned knowledge on historical tasks.

The method can finely tune the text knowledge inference model according to the properties of the high-quality memory set sample and the screened sample proposed by practical problems, can better help the model to remember the previous task, and can help the model to train the current task, thereby effectively increasing the robustness of the model in practical use.

Besides, the method can also be used on other similar tasks in the natural language processing field, such as continuous learning tasks based on contextual memory playback, such as knowledge question answering and text classification.

Drawings

FIG. 1 is a diagram illustrating the relationship between tasks and data sets in text-oriented reasoning in the present invention;

FIG. 2 is a flow chart of a continuous learning history sample selection method for text knowledge reasoning according to the present invention;

FIG. 3 is a schematic diagram of a sample selection flow of the method for selecting a continuous learning history sample oriented to text knowledge inference according to the present invention;

FIG. 4 is a flow chart of the method for continuously learning current sample selection based on text knowledge inference according to the present invention.

Detailed Description

The invention is described in detail below with reference to the following examples and the accompanying drawings of the specification, but is not limited thereto.

Examples 1,

As shown in fig. 1 and fig. 2, a sample selection method for continuous learning of a text knowledge inference model includes: selecting a historical task sample and a current task sample;

obtaining a central vector and a selected sample, as shown in fig. 2, wherein the central vector refers to the central vectors of all samples describing the same knowledge point, and the selected sample is obtained by adding the selected sample to a memory sample set

Performing the following steps;

when the central vector is obtained:

for the shapes of

Using formula (I) (II) to calculate the surface feature center vector

And implicit feature center vectors

, wherein

Representative data set

Description of (A) to

A point of knowledge

All of the samples of (a) are,

in formulas (I) and (II)

And

is/are as follows

Center vector of surface feature and

implicit feature center vectors;

The implicit features, expressed using the sequence-BERT vector, are noted

, wherein

Is the text that is coded;

when the sample is selected, the process is shown in fig. 3, and includes:

(1-1) determining the number of samples selected and added into a memory set:

the formulas (III) - (IV) are to determine how many samples describing the same knowledge point are selected and added into the memory set, and to remember the current task as the first task according to the sequence of the tasks

Individual task, history

The amount of samples selected in the data set of each task is

As in formula (III), wherein

determined by the formula (IV)

First of a data set of a task

wherein

Denotes the first

A data set;

is shown as

Described in the individual data set

A data set of individual knowledge points;

(1-2) selecting a sample:

selecting a memory set by measuring samples through representativeness, difference and balance indexes, and traversing the samples by using one of the following two schemes when selecting the samples;

scheme (2): traversing each sample at equal probability randomly;

if the traversed samples meet representativeness, difference and balance, adding a memory set;

(2) Wherein, the current task sample selection comprises:

sample representative analysis, sample difficulty analysis and sample sampling; the flow is shown in FIG. 4;

for the current task (first

Task) data set

Samples in a candidate set of samples of knowledge points

Is a set of hypothesized text surface feature vectors

To carry out

Clustering, assigning a cluster number of

，

Can be determined according to the number of the knowledge points of the previous task, namely, the number similar to the number of the knowledge points of the previous task is selected to obtain the knowledge points according to the formula (VI)

Cluster of a plurality of clusters

In the current task

A point of knowledgeThe number of samples to be extracted is

For each cluster

Calculating the variance of samples in a cluster

And number of samples in a cluster

To analyze sample representativeness: variance (variance)

(ii) large and large-number-of-samples clusters, where each sample is less representative of the cluster, and more samples from the cluster are sampled to maintain the cluster representative, the cluster is determined according to equation (V)

Number of samples sampled in

,

Means first

In the individual task, for

The first knowledge point

Number of samples selected in each cluster:

professional text reasoning model with pre-training

Inputting samples

Is a precondition text

And hypothesis text

Making inference prediction to obtain class set

Up-predicted probability distribution

,

Difficulty (2) of

，

The smaller the representation reasoning model is, the more difficult the sample is: wherein

Representing professional text reasoning models

Predicting a sample class of

The probability of (a) of (b) being,

representation by professional text reasoning model

Predicting the probability of the sample class being c;

(2-3-1) for the current task data set

Number of sample samples in

To maintain the representativeness of the screened sample set;

(2-3-2) thereafter, the sample is subjected to the sample difficulty analysis described in the step (2-2), and the sample is calculated

Of the difficulty quantization value

For clusters

From which sampling is most difficult

Samples, i.e. according to the sample

Selecting from small to large value

Adding the selected small samples into each sample, and clustering

All the above sampling processes are carried out to complete the knowledge points

。

Examples 2,

The sample selection method for continuous learning of the text knowledge inference model as described in embodiment 1 includes, in step (1), a specific method for obtaining surface layer features of the text:

In the step (1), a specific method for acquiring the implicit characteristics of the text comprises the following steps:

Examples 3,

The method for selecting the sample for the continuous learning of the text knowledge inference model as in embodiments 1 and 2 includes, in step (1-2), a method for selecting a memory set by measuring samples according to the indexes of representativeness, difference and balance, including:

for a series of tasks

With a corresponding data set on each task

, wherein

Indicating the task that is currently to be learned,

representing its corresponding data set, each sample in the data set being noted

A memory playback is performed in which, among other things,

in order to perform the union-union operation,

the operating range being all tasks preceding the current task, i.e.

;

Representing a data set

A set of samples to which a memory set is added,

is a memory set

The abbreviation of (1); the collection

Satisfy representativeness, diversity and balance;

said difference, for the sample

Using surface and implicit feature vectors

The representativeness is a specific method for measuring by using Local Outlier Factor (LOF):

equation (VIII) is used to calculate the distance between two vectors:

the above-mentioned

Vector of

；

Equation (IX) for obtaining the vector

To vector

To (1) a

The reachable distance is as follows:

wherein ,

represented in vector space, vector

To it's first

Distance between close vectors;

the formula (X) is used for calculating

Local accessible density:

wherein ,

is a vector of

To (1) a

Set of all vectors within a distance, wherein a vector

To (1)

Distance refers to a vector

To achieve the first

The distance of the neighboring vectors;

equation (XI) for the calculation of

Local outlier factor:

when vector

Is/are as follows

when vector

Is/are as follows

When the value is less than or equal to 1, the more the local reachable density of the current vector is greater than the local reachable density of the surrounding vectors in the vector space, and the smaller the value is, the greater the aggregative property of the current sample is indicated;

When the LOF value is smaller, the lower the LOF value,

the formula (XII) is a sample

Is/are as follows

The calculation method comprises the following steps:

wherein ,

is an adjustable parameter that is,

(ii) a If the implicit characteristic is significant, then it is reduced

always kept in a similar range.

Said difference, for the sample

as shown in formula (XIII) (XIV), wherein

The balance is selected to be added to the sample set of the memory set

Characteristic distribution of (2) and raw data set

The characteristic distribution is approximate:

wherein ,

representative parameter is

The probability distribution of the memory set samples of (2),

is composed of

In the short-hand form of (1),

should be considered as a whole;

sample probability distribution of the original data set;

，

respectively are the parameters of the probability distribution of the original data set sample and the probability distribution of the memory set sample; for a probability distribution

Accompanied by a specific parameter determining its distribution

Is thus represented as

；

Is the basic operation of a collective operation;

representing a sample;

representing descriptive knowledge points

The sample of (1);

In the above process, the related super ginseng has

The user can use default values and can also customize the default values to meet actual service requirements. The two parts of the invention can enable the user to select the strategy suitable for the user to select the sample, and the user gives the sample set and the sampleThe invention can automatically select historical samples according to a set algorithm and add the historical samples into a memory set for subsequent training of a natural language reasoning model facing professional knowledge points.

Examples 5,

An apparatus for implementing the sample selection method according to embodiments 1-4, comprising: the system comprises a central vector calculation module, a sample selection model and a training module;

Examples 6,

A computer apparatus for implementing the sample selection method of embodiments 1-4, comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes:

in the use stage, a user gives tasks, precedence among the tasks, strategies and optional super parameters; and selecting a historical sample according to a sample selection method, adding the historical sample into a memory set, and using the historical sample for subsequent professional text reasoning model training, namely professional knowledge point-oriented natural language reasoning model training.

Claims

1. A sample selection method for continuous learning of a text knowledge inference model is characterized by comprising the following steps: selecting a historical task sample and a current task sample;

(1) Wherein the selection of the historical task sample comprises the following steps: obtaining a central vector and selecting a sample;

when the central vector is obtained: for the shapes of

Using formula (I) (II) to calculate the surface feature center vector

And implicit feature center vectors

, wherein

Representative data set

Description of (1)

A point of knowledge

All of the samples of (a) are,

in formulas (I) and (II)

And

is/are as follows

Center vector of surface feature and

implicit feature center vectors;

acquiring surface characteristics and implicit characteristics of the text: the surface layer features are expressed by word frequency and inverse document frequency and are recorded as

The implicit features, expressed using the sequence-BERT vector, are noted

, wherein

Is coded text;

when selecting the sample, the method comprises the following steps:

(1-1) determining the number of samples selected and added into a memory set:

(1-2) selecting samples, namely selecting a memory set by measuring the samples according to the indexes of representativeness, difference and balance, traversing the samples by utilizing one of the following two schemes when selecting the samples,

scheme (2): traversing each sample at equal probability randomly;

otherwise, abandoning the traversed sample, and continuing to perform the next traversal until the number of the selected samples meets the requirement;

(2) The current task sample selection comprises the following steps: sample representative analysis, sample difficulty analysis and sample sampling;

(2-1) performing sample representative analysis in the current task sample selection;

(2-2) analyzing the sample difficulty in the sample selection of the current task;

and (2-3) sampling samples in the sample selection of the current task.

2. The sample selection method for continuous learning of the text-oriented knowledge inference model according to claim 1, characterized in that in step (1):

and (1-1) determining the number of samples selected and added into the memory set:

according to the sequence of the tasks, the current task is recorded as the first task

Individual task, history

The amount of samples selected in the data set of each task is

As in formula (III), wherein

determining the ith from the data set of the ith task by equation (IV)

wherein

Is shown as

A data set;

is shown as

Described in the individual data set

A data set of individual knowledge points;

in step (2), (2-1) the sample representative analysis in the current task sample selection includes:

for the current task data set

Samples in a candidate set of samples of knowledge points

Set of hypothesized text surface feature vectors (SVC)

To carry out

Clustering, assigning a cluster number of

According to the formula (VI), obtaining

Cluster of a plurality of clusters

In the current task

The number of samples to be extracted for each knowledge point is

For each cluster

Calculating the variance of samples in a cluster

And number of samples in a cluster

To analyze sample representativeness: determining clusters according to formula (V)

Number of samples sampled in

,

Means first

In the individual task, for

The first knowledge point

Number of samples selected in each cluster:

(2-2) sample difficulty analysis in the sample selection of the current task, which comprises the following steps:

professional text reasoning model with pre-training

Inputting samples

Is a precondition text

And hypothesis text

Making inference prediction to obtain class set

Up-predicted probability distribution

,

Difficulty (2) of

： wherein

Representing professional text reasoning models

Predicting the sample class as

The probability of (a) of (b) being,

representation by professional text reasoning model

Predicting the probability of the sample class being c;

(2-3-1) for the current task data set

Number of sample samples in

To maintain the representativeness of the screened sample set;

Of the difficulty quantization value

For clusters

From which sampling is most difficult

Samples, i.e. according to the sample

Selecting from small to large value

。

3. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 1, wherein in the step (1), the method for acquiring the surface layer features of the text specifically comprises:

4. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 1, wherein in step (1), the implicit features of the text are obtained by a specific method comprising:

5. The method for selecting samples for continuous learning of the text-oriented knowledge inference model according to claim 1, wherein in step (1-2), the method for selecting the memory set by measuring the samples according to the indexes of representativeness, difference and balance comprises:

the representativeness is measured by using a local outlier factor;

said difference, for the sample

Using surface and implicit feature vectors

Measuring the distance;

6. The method for selecting the sample for the continuous learning of the text knowledge inference model according to claim 5, wherein the representativeness is measured by using a local outlier factor:

equation (VIII) is used to calculate the distance between two vectors:

the described

Vector of

；

Equation (IX) for obtaining the vector

To vector

To (1) a

The reachable distance is:

wherein ,

represented in vector space, vector

To it's first

Distance between close vectors;

the formula (X) is used for calculating

Local accessible density:

wherein ,

is a vector of

To (1) a

Set of all vectors within a distance, wherein a vector

To (1) a

Distance refers to a vector

To achieve the first

The distance of the neighboring vectors;

equation (XI) for the calculation of

Local outlier factor:

when vector

Is/are as follows

when vector

Is/are as follows

When the LOF value is smaller, the lower the LOF value,

the larger;

the formula (XII) is a sample

Is/are as follows

The calculation method comprises the following steps:

wherein ,

is an adjustable parameter that is,

and is used to indicate the relative importance of the surface features.

7. The method of claim 5, wherein the difference is a sample of the knowledge inference model learning continuously

as shown in formula (XIII) (XIV), wherein

Is the difference threshold:

candidate samples are entered into the memory set when they satisfy formula (XIII) (XIV).

8. The method for selecting samples for continuous learning of the text knowledge inference model according to claim 5, wherein the balance is selected from a sample set added to a memory set

Feature distribution and raw data set of

The characteristic distribution is approximate:

wherein ,

representative parameter is

The probability distribution of the memory set samples of (2),

is composed of

In the short-hand form of (1),

should be considered as a whole;

sample probability distribution of the original data set;

，

Accompanied by a specific parameter determining its distribution

Is thus represented as

；

Is the basic operation of a collective operation;

representing a sample;

representing descriptive knowledge points

The sample of (1);

is a set of knowledge points.

9. An apparatus for implementing the sample selection method according to any one of claims 1 to 8, comprising: the system comprises a central vector calculation module, a sample selection model and a training module;

the central vector calculating module is used for calculating central vectors of the surface features and the implicit features according to the label information of the samples;

selecting a proper sample and adding the sample into a memory set by the sample selection model to finally obtain a complete memory set;

10. A computer device implementing the sample selection method of any one of claims 1-8, wherein: comprising a processor, a storage device, and a computer program stored on the storage device and executable on the processor; the processor, when executing the computer program, implements the following processes: