WO2021139317A1

WO2021139317A1 - Data feature enhancement method and apparatus for corpus data, computer device, and storage medium

Info

Publication number: WO2021139317A1
Application number: PCT/CN2020/122842
Authority: WO
Inventors: 林佳佳; 郝正鸿; 王少军; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-08-05
Filing date: 2020-10-22
Publication date: 2021-07-15
Also published as: CN111914936A; CN111914936B

Abstract

Disclosed are a data feature enhancement method and apparatus for corpus data, a computer device, and a storage medium, relating to artificial intelligence technology. The method comprises: after a full corpus data set is acquired, firstly carrying out data grouping to obtain a plurality of corpus data subsets; every time one corpus data subset is sequentially deleted, training a user intention recognition model to be trained in order to obtain a plurality of user intention recognition models; taking each piece of data in the full corpus data set as training sample data and test sample data, and correspondingly calculating a model average correction rate difference value, a sample recall rate difference value and a prediction correction rate difference value, respectively, to acquire a sample contribution degree triple corresponding to each piece of corpus data; and if there is corpus data, three difference values in a corresponding sample contribution degree triple of which are negative values, acquiring target corpus data to form a corpus data set to be deleted, and then deleting said corpus data set from the full corpus data set. By means of the method, the apparatus, the computer device and the storage medium, automatic cleaning of negative-contribution corpus data is achieved, and no human intervention is needed in the cleaning process, thereby improving the acquisition efficiency of a high-quality training set.

Description

Data feature enhancement method, device, computer equipment and storage medium of corpus data

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 5, 2020, the application number is 202010777836.8, and the application title is "Data feature enhancement method, device and computer equipment for corpus data", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the technical field of artificial intelligence model hosting, and in particular to a method, device, computer equipment and storage medium for enhancing data features of corpus data.

Background technique

Traditional conversational robots train deep learning models with corpus data to complete tasks such as user intention recognition. The quality of the training corpus is the key to the effect of the model. The quality of the corpus is generally measured in two aspects: "quality" and "quantity". "Quality" is to ensure the correctness of the corpus and the boundaries between different intentions are clear. "Quantity" is to ensure that the model can fully learn the distribution of data features. , The two complement each other and are indispensable.

When sorting out the training data, the R&D staff found that adding a sample to the training set when expanding the “quantity” of the training set does not necessarily bring a positive impact.

At the same time, the inventor found that expanding the training corpus also requires a lot of manpower, that is, the required manpower cost is higher. This is because the current work of corpus data cleaning is almost done manually, which leads to low efficiency in obtaining high-quality training sets.

Summary of the invention

The embodiments of the present application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data, aiming to solve the problem that the expansion of training corpus in the prior art is completed manually, which requires high labor costs and expands the process of predicting data The data cleaning process in is also done manually, which leads to the problem of low efficiency in obtaining high-quality training sets.

In the first aspect, an embodiment of the present application provides a data feature enhancement method for corpus data, which includes:

Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;

Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;

One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;

Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;

Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;

Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;

According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;

Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;

If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as

The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.

In the second aspect, an embodiment of the present application provides a data feature enhancement device for corpus data, which includes:

A corpus data set acquisition unit, configured to acquire a full corpus data set; wherein, the full corpus data set includes multiple corpus data;

A data set dividing unit, configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups;

The group training unit is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. ; Wherein, after each round of deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as Test sample data;

The average correct rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model average corresponding to the test sample data of each user intent recognition model. Calculate the difference between the correct rate and the average correct rate of the second model to obtain the average correct rate difference corresponding to each corpus data;

The sample recall rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first sample corresponding to the test sample data of each user intent recognition model. The difference between the recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data;

The prediction accuracy difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction average corresponding to the test sample data of each user intent recognition model. The difference between the correct rate and the average correct rate of the second prediction is calculated to obtain the difference of the prediction correct rate corresponding to each corpus data;

The sample contribution degree triplet acquisition unit is used to obtain the sample contribution degree triplet corresponding to each corpus data according to the average correctness rate difference, the sample recall rate difference and the prediction accuracy difference corresponding to each corpus data ；

The triple judgment unit is used to judge whether there is a sample contribution degree corresponding to the corpus data. The average accuracy difference, the sample recall difference and the prediction accuracy difference in the triples are all negative;

The negative sample deletion unit is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values. To form a corpus data set to be deleted; and

The first update unit of the data set is used to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:

In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :

The embodiments of the application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data. After the full corpus data set is obtained, the data is grouped first to obtain multiple sets of corpus data subsets, each in sequence After deleting a subset of corpus data, train the user intent recognition model to be trained to obtain multiple user intent recognition models. Use each data in the full corpus data set as training sample data and as test sample data, respectively, corresponding to the difference in the average accuracy of the calculation model Value, sample recall rate difference, and prediction accuracy rate difference to obtain the sample contribution triples corresponding to each corpus data; if there are three differences in the sample contribution triples corresponding to the corpus data as negative values, get the corresponding The target corpus data of to form a corpus data set to be deleted to be deleted from the full corpus data set. The automatic cleaning of negative contribution corpus data is realized, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application;

2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application;

FIG. 3 is a schematic block diagram of an apparatus for enhancing data features of corpus data provided by an embodiment of this application;

Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to Figure 1 and Figure 2. Figure 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application; Figure 2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application. The data feature enhancement method of corpus data is applied to the server, and the method is executed by the application software installed in the server.

As shown in Figure 2, the method includes steps S101 to S110.

S101. Receive a full corpus data set sent by a user terminal; wherein the full corpus data set includes multiple corpus data.

In this embodiment, the client sends a full corpus data set to the server to filter the high-quality sample data with high sample contribution through the server, and then feed it back to the client, so that the client can be based on a high-quality sample data set. The data set of quality sample data is used to train the model to be trained (for example, convolutional neural network, BERT model, etc.). For example, the full corpus data set is recorded as data set X. In this application, in order to understand the subsequent technical solutions more simply, the following takes only 20 pieces of corpus data included in the data set X as an example for illustration, but the data is in specific implementation The corpus data included in set X are far greater than 20. Among them, the above 20 pieces of corpus data can be recorded as the i-th piece of corpus data, and the value range of i is a positive integer value in [1,20].

S102. Invoke a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups.

In this embodiment, in order to group the full corpus data set (ie, data set X), the total number of groups stored in the server needs to be obtained at this time. For example, the total number of groups is denoted as k. In this application, for a simpler understanding of the subsequent technical solutions, the following takes the total number of groups as k and k=4 as an example for illustration, but the total number of groups in specific implementation hours is not necessarily The value is 4, or other positive integer values.

Since the data set X includes 20 corpus data, and the total group value k=4, the 20 corpus data in the full corpus data set is divided into 5 corpus data subsets according to the total group value of 4. The set can be recorded as the j-th corpus data subset and the value range of j is the positive integer value in [1,5]. In order to understand the above grouping process more simplified, the following is to divide the first corpus data-the fifth corpus data into the first corpus data subset, and the sixth corpus data-the tenth corpus data into the second Corpus data subset, divide the 11th corpus data-the 15th corpus data into the 3rd corpus data subset, and divide the 16th corpus data-the 20th corpus data into the 4th corpus data subset as an example Let's continue to explain the follow-up process.

S103. Delete one of the corpus data subsets corresponding to the full corpus data set and input them into the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each After one round of deletion of one of the corpus data subsets corresponding to the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data.

In this embodiment, in order to increase that each corpus data in the single-round verification process can be used for training or testing the user intent recognition model multiple times, the following can be used to put it down: sequentially delete one of the corresponding partitions of the full corpus data set Subsets of the corpus data are respectively input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. Through this cross-validation method, calculating the contribution of N samples only needs to train k models, which reduces the complexity and improves the efficiency of data contribution analysis.

In an embodiment, step S103 includes:

Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value

Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;

The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.

In this embodiment, the explanation continues with k=4. For example, after the first corpus data subset corresponding to the division of the full corpus data set is deleted for the first time, the remaining second corpus data subset and the third corpus data subset are deleted for the first time. The data subset of the corpus No. and the data subset of the No. 4 corpus form the first large-round first small-round training set, and the deleted No. 1 corpus data subset is used as the first large-round first small-round test set. At this time, after the user intention recognition model to be trained is trained through the first large round and the first small round training set, the first large round and the first small round user intention recognition model is obtained.

After deleting the second corpus data subset corresponding to the full corpus data set for the second time, the remaining corpus data subset No. 1, the third corpus data subset, and the No. 4 corpus data subset form the second corpus data subset. In a large-round second-small-round training set, the deleted corpus data subset of No. 2 is used as the first-large-round and second-small-round test set. At this time, after training the user intention recognition model to be trained through the first large-round and second small-round training set, the first large-round and second small-round user intention recognition model is obtained.

Then after deleting the third corpus data subset corresponding to the full corpus data set for the third time, the remaining corpus data subset No. 1, the second corpus data subset, and the No. 4 corpus data subset form the third corpus data subset. In a large third round of training set, the deleted No. 3 corpus data subset is used as the first large round and third small round of test set. At this time, after the user intention recognition model to be trained is trained through the first large round and the third small round training set, the first large round and the third small round user intention recognition model is obtained.

Finally, after deleting the fourth corpus data subset corresponding to the full corpus data set for the fourth time, the remaining corpus data subset No. 1, the second corpus data subset, and the No. 3 corpus data subset form the fourth corpus data subset. A large round of the fourth small round of training set, the deleted No. 4 corpus data subset as the first large round of the fourth small round of test set. At this time, after training the user intention recognition model to be trained through the first large round and fourth small round training set, the first large round and fourth small round user intention recognition model is obtained.

After deleting the subsets of the corpus data from the full corpus data set in the above order, after training the user intent recognition models to be trained respectively, the user intent recognition models with the same number as the total number of groups are obtained.

S104. Obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model, and the average correctness of the first model and the average correctness of the second model respectively corresponding to the test sample data of each user intent recognition model. Rate difference value to get the average correct rate difference value corresponding to each corpus data.

In this embodiment, starting from the first corpus data in the data set X as an example to illustrate the sample contribution triples corresponding to the 20 corpus data in the data set X, the sample contribution triples are averaged and correct by the model It is composed of the difference in rate, the difference in sample recall rate and the difference in prediction accuracy.

In an embodiment, step S104 includes:

Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;

If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;

When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.

In this embodiment, for example, the first piece of corpus data is used as the test data sample is the first large round of the first small round of training, the first piece of corpus data is used as the training data sample is the first large round of the second small round of training During the process, during the first big round and the third small round training process, and during the first big round and the fourth small round training process. That is, when the first corpus data is used as the training data sample, the user intent recognition models obtained are the first large round and the second small round user intent recognition model, the first large round and the third small round user intent recognition model, and the first round. The user intent recognition model in the fourth round of the big round; when the first piece of corpus data is used as the test data sample, the user intent recognition model obtained is the user intent recognition model of the first big round and the first small round.

At this time, the first large-round and second small-round test set corresponding to the first large-round and second small-round user intent recognition model is used for model verification testing, and the first model of the first large-round and second small-round user intent recognition model is obtained The correct rate, where the correct rate of the first model is equal to the number of test data items predicted to be correct in the first large round and the second small round test set divided by the total number of data items in the first large round and the second small round test set; for example, the sixth corpus The output value of the data input into the user intent recognition model in the first large round and the second small round is equal to the corresponding label value in the sixth corpus data, which means that the user intent recognition model in the first large round and the second small round correctly predicted the first round. Results of 6 corpus data. Similarly, when the 7th corpus data, the 8th corpus data, and the 10th corpus data are input into the first round and the second round of the user intent recognition model, the correct results can be predicted respectively, and when the 9th corpus data After inputting into the first large round and the second small round user intention recognition model, the corresponding label value in the 9th corpus data cannot be predicted. At this time, the first model corresponding to the first large round and the second small round user intention recognition model is correct The rate is 80%.

Refer to the above process to obtain the correct rate of the second model corresponding to the user intent recognition model in the first large round and the third small round, and obtain the correct rate of the third model corresponding to the user intent recognition model in the first large round and the fourth small round. After it is 100%, it can be calculated that the first corpus data is the average correct rate of the first model corresponding to the training data sample (80%+60%+100%)/3=80%.

When the first corpus data is calculated as the average correct rate of the second model corresponding to the test data sample, at this time, the first large round and the first small round of the user intent recognition model corresponding to the first large round and the first small round of the test set Perform a model verification test to get the second model accuracy rate of the first large round and the first small round of the user intent recognition model (because the first corpus data is used as the test data sample, it only corresponds to one user intent recognition model, that is, The first round of the first small round of user intention recognition model, so the accuracy of the second model can be regarded as the average accuracy of the second model), where the accuracy of the second model is equal to the correct prediction in the first small round of the first large round Divide the number of test data by the total number of data in the first big round and the first small round of the test set; for example, the output value of the first corpus data input into the first big round and the first small round of the user intent recognition model is equal to the first one The corresponding label value in the corpus data at this time indicates that the first large round and the first small round user intention recognition model correctly predicted the result of the first corpus data. Similarly, when the second corpus data and the third corpus data are input into the first large round and the first small round of user intent recognition model, the correct results can be predicted respectively, and when the fourth corpus data and the fifth corpus data After inputting into the first big round and the first small round user intention recognition model, the corresponding label value cannot be predicted. At this time, the correct rate of the second model corresponding to the first big round and the first small round user intention recognition model is 60%. That is, the average correct rate of the second model is equal to 60%.

After obtaining the average correct rate of the first model equal to 80% and obtaining the average correct rate of the second model equal to 60% in the above process, you can calculate the average correct rate of the first model 80% and the average correct rate of the second model 60% The difference is taken as the average accuracy difference corresponding to the first corpus data (at this time, the average accuracy difference is equal to 20%). That is to say, when calculating the difference in the average correctness rate corresponding to the i-th corpus data, you can refer to the calculation process of the difference in the average correctness rate of the first corpus data. By obtaining the average correctness difference corresponding to each corpus data, it can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.

S105. Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Find the difference to get the difference of the sample recall rate corresponding to each corpus data.

In this embodiment, the difference in sample recall rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.

In an embodiment, step S105 includes:

Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;

If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;

When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.

In this embodiment, for example, when calculating the sample recall rate difference corresponding to the first corpus data, and also when the first corpus data is first calculated as the training data sample, the first large round and the second small round user intention recognition model, The recall rate of the first model corresponding to the user intent recognition model in the first large round and the third small round and the user intent recognition model in the first large round and the fourth small round is 20% (if the predicted intent of the first corpus data itself is A, Then the recall rate of the first model is calculated by dividing the actual number of test sample data with the model prediction result of A and the correct prediction in all test sample data corresponding to the user intention recognition model in the first large round and the second small round by dividing all the test sample data. The prediction result of the middle model is the total number of test sample data of A), the recall rate of the second model is 40% (the specific calculation method refers to the calculation method of the recall rate of the first model), and the recall rate of the third model is 60% (the specific calculation method) Refer to the calculation method of the recall rate of the first model), so the recall rate of the first sample is obtained by averaging the recall rate of the first model, the recall rate of the second model and the recall rate of the third model, that is, the recall rate of the first sample It's 40%. After calculating the first corpus data as the test data sample, the fourth model recall rate corresponding to the first large round and the first small round user intention recognition model is 20%, then the fourth model recall rate can be used as the second sample recall The difference of the sample recall rate corresponding to the first corpus data is 20%. When calculating the sample recall rate difference corresponding to the i-th corpus data, the calculation process of the sample recall rate difference of the first corpus data can be referred to.

S106. Obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model, and the first prediction average correct rate and the second prediction average correct respectively corresponding to the test sample data of each user intent recognition model Rate difference value to get the difference value of prediction accuracy rate corresponding to each corpus data.

In this embodiment, the difference in the prediction accuracy rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.

In an embodiment, step S106 includes:

If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.

In this embodiment, for example, when calculating the prediction accuracy difference corresponding to the first corpus data, it is also when the first corpus data is first calculated as the training data sample, the first large round and the second small round user intention recognition model, The first prediction accuracy corresponding to the first large round and the third small round user intent recognition model and the first large round and the fourth small round user intent recognition model respectively is 100% (if the prediction result of the first corpus data itself is A, Then the calculation method of the first prediction accuracy is that the prediction result of the first corpus data in the first large round and the first small round of the user intent recognition model is A, and the first corpus data is in the first large round and the first small round The total number of test data samples corresponding to the user intention recognition model is 1, then the number of correct prediction results of the first corpus data is divided by the first corpus data as the total number of test data samples to obtain the first prediction accuracy rate Is 100%), the second prediction accuracy rate is 100% (the specific calculation method refers to the calculation method of the first prediction accuracy rate), the third prediction accuracy rate is 100% (the specific calculation method refers to the calculation method of the first prediction accuracy rate) In this way, the average first prediction accuracy rate is obtained by averaging the first prediction accuracy rate, the second prediction accuracy rate, and the third prediction accuracy rate, that is, the first prediction average accuracy rate is 100%.

Later, when calculating the first corpus data as the test data sample, use the first large round and the second small round user intention recognition model, the first large round and the third small round user intention recognition model, and the first large round and the fourth small round user intent. The fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate corresponding to the recognition model are averaged to obtain the second prediction average accuracy rate. Among them, the calculation of the fourth prediction accuracy is that the prediction result of the first corpus data in the first large round and the second small round of the user intent recognition model is A, the first large round and the third small round of the user intent recognition model The prediction result of 1 corpus data is A, the prediction result of the first corpus data in the first large round and the fourth small round of the user intent recognition model is A, and the first corpus data is in the first large round and the second small round The total number of test data samples corresponding to the user intent recognition model to the first large round and the fourth small round of the user intent recognition model is 3, then the total number of correct prediction results of the first corpus data is divided by the first corpus The data is used as the total number of training data samples, and the fourth prediction accuracy rate is 100%. The calculation methods of the fifth prediction accuracy rate and the sixth prediction accuracy rate refer to the calculation method of the fourth prediction accuracy rate mentioned above. For example, the fifth prediction accuracy rate is 100%, and the sixth prediction accuracy rate is 100%, then the first corpus The average second prediction accuracy rate corresponding to the data is 100% (obtained by averaging the fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate). At this time, the prediction accuracy difference corresponding to the first corpus data is equal to the difference between the first prediction average accuracy rate and the second prediction average accuracy rate, that is, the prediction accuracy difference corresponding to the first corpus data is equal to zero. When calculating the difference in the prediction accuracy rate corresponding to the i-th corpus data, all can refer to the calculation process of the difference in the prediction accuracy rate of the first corpus data.

S107: According to the difference of the average correctness rate, the difference of the sample recall rate, and the difference of the prediction correctness rate corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data.

In this embodiment, in order to make an objective judgment on whether each corpus data is a negative contribution sample, at this time, it is necessary to first perform the difference in the average correctness rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data. Combine to obtain the sample contribution triples corresponding to each corpus data.

In an embodiment, step S107 includes:

The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.

In this embodiment, after obtaining the model average correct rate difference of 20%, sample recall rate difference of 20%, and prediction accuracy rate difference of 0 corresponding to the first corpus data, the sample corresponding to the first corpus data The contribution triple is [20%, 20%, 0]. Similarly, after the first round of verification tests are completed, it is possible to know the sample contribution triples corresponding to any i-th corpus data in the data set X.

S108: Determine whether there is a difference in the average correctness rate, the sample recall rate and the prediction accuracy rate difference in the sample contribution triples corresponding to the corpus data, which are all negative values.

In this embodiment, when the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference in the sample contribution triples corresponding to a piece of corpus data are all negative values, it means that the corpus data is used as training data There is a high probability that training the user intent recognition model will not make a beneficial contribution. At this time, you can consider deleting the corpus data from the full corpus data set to improve the training data quality of the updated full corpus data set.

When the average accuracy difference, sample recall difference and prediction accuracy difference in the sample contribution triples corresponding to a piece of corpus data are not all negative values, it means that the corpus data is used as training data to train user intention recognition The model is likely to make a useful contribution and can continue to be retained in the full corpus data set.

S109. If there is a sample contribution triplet corresponding to the corpus data, the difference in the average accuracy rate, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus to be deleted data set.

In this embodiment, when all the sample contribution triples in the full corpus data set are obtained, the three rates (that is, the difference in average accuracy, the difference in sample recall, and the difference in prediction accuracy) are all negative targets Corpus data, these targets can form the corpus data set to be deleted, and the corpus data in the corpus data set to be deleted can be deleted from the full corpus data set to improve the data quality of the full corpus data set.

S110. Delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.

In this embodiment, when the corpus data set to be deleted is deleted from the full corpus data set, the full corpus data set has changed at this time. Compared with the full corpus data set initially obtained in step S101, the current The total number of corpus data in the full corpus data set of the state is less than or equal to the total number of corpus data in the full corpus data set initially acquired in step S101. This updated full corpus data set can be used as a simplified high-quality training set to continue training the user intent recognition model locally on the server to obtain a user intent recognition model with higher recognition accuracy.

In an embodiment, after step S110, the method further includes:

Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;

Judging whether the current number of iterations exceeds a preset maximum number of iterations;

If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;

Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;

If the current number of iterations exceeds the preset maximum number of iterations, the process ends.

In this embodiment, after performing a round of sample data screening at step S110, the amount of data may be reduced. In order to ensure that the total amount of corpus data in the data set remains unchanged or increases, you can first determine whether the process of supplementing corpus data can be performed.

That is, first obtain the current iteration number (wherein, the initial value of the current iteration number is 0), and add one to the current iteration number to update the current iteration number. Generally, the maximum number of iterations is greater than 2, so after executing a round of samples After the data is filtered, you can continue to perform the step of supplementing the corpus data. That is, if the current number of iterations does not exceed the maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select from the local corpus pool the same total number of supplementary corpus data. Number of supplementary corpus data to form a supplementary corpus data set, so as to realize the data supplement of the full corpus data set in step S110 to update the data set. After updating the complete corpus data set, return to step S101 to proceed to the next round of Data filtering. When the full corpus data set after the next round of data screening can enter the next round of data sample screening, it is necessary to add one to the current iteration number to update the current iteration number; then determine whether the current iteration number is Exceeds the preset maximum number of iterations (for example, if the maximum number of iterations is set to 10, then 10 rounds of corpus data supplementation can be performed), if the current number of iterations does not exceed the maximum number of iterations, return to step S101 to proceed again One round of data screening; if the current number of iterations exceeds the maximum number of iterations, the step of ending the process is executed. It can be seen that the automatic expansion of data samples in the data set is realized through the above-mentioned method. After that, the final full corpus data set can be input to the user intent recognition model to be trained for training, and the final user intent recognition model is obtained.

This method realizes the automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.

The embodiment of the present application also provides a data feature enhancement device for corpus data, and the data feature enhancement device for corpus data is used to execute any embodiment of the aforementioned data feature enhancement method for corpus data. Specifically, please refer to FIG. 3, which is a schematic block diagram of a data feature enhancement device for corpus data provided in an embodiment of the present application. The data feature enhancement device 100 of the corpus data can be configured in a server.

As shown in FIG. 3, the data feature enhancement device 100 for corpus data includes: a corpus data set acquisition unit 101, a data set division unit 102, a group training unit 103, an average correct rate difference calculation unit 104, and a sample recall rate difference calculation unit 105. The prediction accuracy rate difference calculation unit 106, the sample contribution degree triplet acquisition unit 107, the triplet judgment unit 108, the negative sample deletion unit 109, and the data set first update unit 110.

The corpus data set acquisition unit 101 is configured to receive a full corpus data set sent by a user terminal; wherein, the full corpus data set includes a plurality of corpus data.

The data set dividing unit 102 is configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups.

The group training unit 103 is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them into the user intention recognition model to be trained to obtain the same number of user intention recognition as the group total value. Model; wherein, after each round of deleting one of the corpus data subsets corresponding to the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset As the test sample data.

In an embodiment, the group training unit 103 includes:

The data set labeling unit is used to mark the full corpus data set as data set X, and the corpus data subsets divided by data set X are respectively recorded as the 1st corpus data subset to the kth corpus data subset, The corpus data subset between the 1st corpus data subset and the kth corpus data subset is marked as the jth corpus data subset; the value of k is equal to the total number of groups, and the value of j is [1, k] Positive integer value in the interval;

The first deletion unit in the first round is used to delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the intention recognition of the user to be trained The training set of the model is trained to obtain the first large round and the first small round user intention recognition model;

The first small round of sequential deletion unit is used to sequentially delete the second corpus data subset to the kth corpus data subset from the full corpus data set to serve as the training set of the user intention recognition model to be trained Training is performed to obtain the user intention recognition model in the first large round and the second small round to the k-th small round user intent recognition model in the first large round in sequence.

The average correct rate difference calculation unit 104 is configured to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate and the average correct rate of the second model is calculated to obtain the average correct rate difference corresponding to each corpus data.

In an embodiment, the average correct rate difference calculation unit 104 includes:

The first judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model; where the value of i The range is a positive integer value in [1,N], and N is equal to the total number of corpus data in the full corpus data set;

The first calculation unit is used to obtain the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the first The model correct rate corresponding to each first target user intent recognition model in the target user intent recognition model set is averaged to obtain the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data;

The second calculation unit is used to obtain the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model, and calculate the second The model accuracy rate corresponding to each second target user intent recognition model in the target user intent recognition model set is averaged to obtain the average accuracy rate of the second model corresponding to the i-th corpus data as the training sample data;

The first difference calculation unit is used to calculate the difference between the average correct rate of the first model corresponding to the i-th corpus data as the training sample data and the average correct rate of the second model corresponding to the i-th corpus data as the test sample data, Obtain the average correct rate difference corresponding to the i-th corpus data.

The sample recall rate difference calculation unit 105 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data corresponding to each user intent recognition model. The difference between this recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data.

In an embodiment, the sample recall rate difference calculation unit 105 includes:

The second judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;

The third calculation unit is used to obtain the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the third The sample recall rate corresponding to each third target user intent recognition model in the target user intent recognition model set is averaged to obtain the corresponding first sample recall rate when the i-th corpus data is used as the training sample data;

The fourth calculation unit is used to calculate the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intent recognition model when the i-th corpus data is used as the test sample data. The sample recall rate corresponding to each fourth target user intent recognition model in the target user intent recognition model set is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;

The second difference calculation unit is used to calculate the difference between the first sample recall rate when the i-th corpus data is used as the training sample data and the second sample recall rate when the i-th corpus data is used as the test sample data to obtain The difference of the sample recall rate corresponding to the i-th corpus data.

The prediction accuracy difference calculation unit 106 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate and the second predicted average correct rate is calculated to obtain the difference of the predicted correct rate corresponding to each corpus data.

In an embodiment, the prediction accuracy difference calculation unit 106 includes:

The third judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;

The fifth calculation unit is used to obtain the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the fifth The prediction accuracy rate corresponding to each fifth target user intent recognition model in the target user intention recognition model set is averaged to obtain the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;

The sixth calculation unit is used to calculate the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model if the ith corpus data is used as the test sample data. The prediction accuracy rate corresponding to each sixth target user intent recognition model in the target user intention recognition model set is averaged to obtain the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;

The third difference calculation unit is used to calculate the difference between the first predicted average correct rate when the i-th corpus data is used as the training sample data and the second predicted average correct rate when the i-th corpus data is used as the test sample data, Obtain the prediction accuracy difference corresponding to the i-th corpus data.

The sample contribution triple acquisition unit 107 is used to obtain the sample contribution triple corresponding to each corpus data according to the average accuracy difference, sample recall difference and prediction accuracy difference corresponding to each corpus data group.

In an embodiment, the sample contribution triple acquisition unit 107 is further configured to:

The triple judging unit 108 is used for judging whether there is a sample contribution degree corresponding to the corpus data. The average correctness rate difference, the sample recall rate difference and the prediction correctness rate difference are all negative.

The negative sample deletion unit 109 is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values , To form a corpus data set to be deleted.

The first data set update unit 110 is configured to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.

In an embodiment, the data feature enhancement device 100 of corpus data further includes:

The current iteration number update unit is used to obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;

The current iteration number judging unit is used to determine whether the current iteration number exceeds a preset maximum iteration number;

The automatic corpus acquisition unit is used to call the preset total number of supplementary corpus data if the current number of iterations does not exceed the preset maximum number of iterations, and randomly select from the local corpus pool that is equal to the total number of supplementary corpus data Supplementary corpus data with the same total number of data items to form a supplementary corpus data set;

The automatic corpus supplement unit is used to add the supplementary corpus data set to the full corpus data set to update the full corpus data set, and return to execute the step of obtaining the full corpus data set;

The process ending unit is used to end the process if the current iteration number exceeds the preset maximum iteration number.

The device realizes automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.

The above-mentioned data feature enhancement device for corpus data can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 4.

Please refer to FIG. 4, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

Referring to FIG. 4, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute the data feature enhancement method of the corpus data.

The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data feature enhancement method of corpus data.

The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the data feature enhancement method of corpus data disclosed in the embodiment of the present application.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged. For example, in some embodiments, the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4, and will not be repeated here.

It should be understood that in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the data feature enhancement method of corpus data disclosed in the embodiments of the present application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method for enhancing data features of corpus data, which includes:

Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;

Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;

One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;

Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;

Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;

Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;

According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;

Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;

If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as

The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
The method for enhancing data features of corpus data according to claim 1, wherein after said deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:

Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;

Judging whether the current number of iterations exceeds a preset maximum number of iterations;

If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;

Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;

If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
The method for enhancing data features of corpus data according to claim 1, wherein said sequentially deleting one of the corpus data subsets corresponding to the division of said full corpus data set is input to the user intent recognition model to be trained to obtain User intention recognition models that have the same number as the total number of groups, including:

Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value

Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;

The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.
The method for enhancing data features of corpus data according to claim 3, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The average correct rate of the first model and the average correct rate of the second model corresponding to the sample data are calculated to obtain the difference of the average correct rate corresponding to each corpus data, including:

Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;

If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;

When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.
The method for enhancing data features of corpus data according to claim 4, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The difference between the first sample recall rate and the second sample recall rate corresponding to the sample data is calculated to obtain the sample recall rate difference corresponding to each corpus data, including:

Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;

If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;

When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
The data feature enhancement method of corpus data according to claim 5, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The difference between the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to the sample data respectively to obtain the prediction accuracy difference corresponding to each corpus data includes:

Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;

If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
The method for enhancing data features of corpus data according to claim 1, wherein the difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data is used to obtain each corpus data. The corresponding sample contribution triples include:

The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
The method for enhancing data features of corpus data according to claim 1, wherein said determining whether there is a difference in average correctness rate, sample recall rate difference, and prediction correctness rate difference in the sample contribution triples corresponding to the corpus data After the values are all negative, it also includes:

If there are corpus data corresponding to the sample contribution triples in the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference are not all negative values, the corpus data is retained in the full corpus data set.
A data feature enhancement device for corpus data, which includes:

A corpus data set acquisition unit, configured to acquire a full corpus data set; wherein, the full corpus data set includes multiple corpus data;

A data set dividing unit, configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups;

The group training unit is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. ; Wherein, after each round of deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as Test sample data;

The average correct rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model average corresponding to the test sample data of each user intent recognition model. Calculate the difference between the correct rate and the average correct rate of the second model to obtain the average correct rate difference corresponding to each corpus data;

The sample recall rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first sample corresponding to the test sample data of each user intent recognition model. The difference between the recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data;

The prediction accuracy difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction average corresponding to the test sample data of each user intent recognition model. The difference between the correct rate and the average correct rate of the second prediction is calculated to obtain the difference of the prediction correct rate corresponding to each corpus data;

The sample contribution degree triplet acquisition unit is used to obtain the sample contribution degree triplet corresponding to each corpus data according to the average correctness rate difference, the sample recall rate difference and the prediction accuracy difference corresponding to each corpus data ；

The triple judgment unit is used to judge whether there is a sample contribution degree corresponding to the corpus data. The average accuracy difference, the sample recall difference and the prediction accuracy difference in the triples are all negative;

The negative sample deletion unit is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values. To form a corpus data set to be deleted; and

The first update unit of the data set is used to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer program:

Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;

Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;

One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;

Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;

Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;

Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;

According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;

Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;

If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as

The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
The computer device according to claim 10, wherein after the deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:

Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;

Judging whether the current number of iterations exceeds a preset maximum number of iterations;

If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;

Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;

If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
The computer device according to claim 10, wherein the sequence deletes one of the corpus data subsets corresponding to the full corpus data set and then respectively inputs them to the user intent recognition model to be trained to obtain the sum of the grouping values The same number of user intention recognition models, including:

Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value

Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;

The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of the k-th small round user intention recognition model.
The computer device according to claim 12, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate of the first model and the average correct rate of the second model is calculated to obtain the difference of the average correct rate corresponding to each corpus data, including:

Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;

If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;

When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.
The computer device according to claim 13, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the recall rate of the first sample and the recall rate of the second sample is calculated to obtain the difference of the sample recall rate corresponding to each corpus data, including:

Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;

If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;

When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
The computer device according to claim 14, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the average accuracy of the first prediction and the average accuracy of the second prediction is calculated to obtain the difference of the prediction accuracy corresponding to each corpus data, including:

Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;

If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;

When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
The computer device according to claim 10, wherein the sample contribution degree corresponding to each corpus data is obtained according to the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference corresponding to each corpus data Triples, including:

The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
The computer device according to claim 10, wherein the difference in the average accuracy rate, the difference in the sample recall rate, and the difference in the prediction accuracy rate in the triad of sample contribution degrees corresponding to the corpus data are all negative values After that, it also includes:

If there are corpus data corresponding to the sample contribution triples in the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference are not all negative values, the corpus data is retained in the full corpus data set.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:

Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;

Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;

One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;

Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;

Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;

Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;

According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;

Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;

If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as

The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
18. The computer-readable storage medium according to claim 18, wherein after the deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:

Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;

Judging whether the current number of iterations exceeds a preset maximum number of iterations;

If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;

Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;

If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
18. The computer-readable storage medium according to claim 18, wherein the one of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and then input to the user intent recognition model to be trained to obtain and group User intention recognition models with the same number of total values, including:

Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value

Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;

The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.