WO2021139317A1 - Data feature enhancement method and apparatus for corpus data, computer device, and storage medium - Google Patents

Data feature enhancement method and apparatus for corpus data, computer device, and storage medium Download PDF

Info

Publication number
WO2021139317A1
WO2021139317A1 PCT/CN2020/122842 CN2020122842W WO2021139317A1 WO 2021139317 A1 WO2021139317 A1 WO 2021139317A1 CN 2020122842 W CN2020122842 W CN 2020122842W WO 2021139317 A1 WO2021139317 A1 WO 2021139317A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus data
data
corpus
recognition model
sample
Prior art date
Application number
PCT/CN2020/122842
Other languages
French (fr)
Chinese (zh)
Inventor
林佳佳
郝正鸿
王少军
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139317A1 publication Critical patent/WO2021139317A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of artificial intelligence model hosting, and in particular to a method, device, computer equipment and storage medium for enhancing data features of corpus data.
  • the quality of the training corpus is the key to the effect of the model.
  • the quality of the corpus is generally measured in two aspects: “quality” and “quantity”. “Quality” is to ensure the correctness of the corpus and the boundaries between different intentions are clear. “Quantity” is to ensure that the model can fully learn the distribution of data features. , The two complement each other and are indispensable.
  • the R&D staff found that adding a sample to the training set when expanding the “quantity” of the training set does not necessarily bring a positive impact.
  • the inventor found that expanding the training corpus also requires a lot of manpower, that is, the required manpower cost is higher. This is because the current work of corpus data cleaning is almost done manually, which leads to low efficiency in obtaining high-quality training sets.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data, aiming to solve the problem that the expansion of training corpus in the prior art is completed manually, which requires high labor costs and expands the process of predicting data
  • the data cleaning process in is also done manually, which leads to the problem of low efficiency in obtaining high-quality training sets.
  • an embodiment of the present application provides a data feature enhancement method for corpus data, which includes:
  • One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
  • sample recall rate and prediction accuracy rate difference corresponding to each corpus data obtain the sample contribution triples corresponding to each corpus data
  • the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
  • the corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  • an embodiment of the present application provides a data feature enhancement device for corpus data, which includes:
  • a corpus data set acquisition unit configured to acquire a full corpus data set; wherein, the full corpus data set includes multiple corpus data;
  • a data set dividing unit configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups;
  • the group training unit is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. ; Wherein, after each round of deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as Test sample data;
  • the average correct rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model average corresponding to the test sample data of each user intent recognition model. Calculate the difference between the correct rate and the average correct rate of the second model to obtain the average correct rate difference corresponding to each corpus data;
  • the sample recall rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first sample corresponding to the test sample data of each user intent recognition model.
  • the difference between the recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data;
  • the prediction accuracy difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction average corresponding to the test sample data of each user intent recognition model.
  • the difference between the correct rate and the average correct rate of the second prediction is calculated to obtain the difference of the prediction correct rate corresponding to each corpus data;
  • the sample contribution degree triplet acquisition unit is used to obtain the sample contribution degree triplet corresponding to each corpus data according to the average correctness rate difference, the sample recall rate difference and the prediction accuracy difference corresponding to each corpus data ;
  • the triple judgment unit is used to judge whether there is a sample contribution degree corresponding to the corpus data.
  • the average accuracy difference, the sample recall difference and the prediction accuracy difference in the triples are all negative;
  • the negative sample deletion unit is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values. To form a corpus data set to be deleted; and
  • the first update unit of the data set is used to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
  • One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
  • sample recall rate and prediction accuracy rate difference corresponding to each corpus data obtain the sample contribution triples corresponding to each corpus data
  • the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
  • the corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
  • One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
  • sample recall rate and prediction accuracy rate difference corresponding to each corpus data obtain the sample contribution triples corresponding to each corpus data
  • the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
  • the corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  • the embodiments of the application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data. After the full corpus data set is obtained, the data is grouped first to obtain multiple sets of corpus data subsets, each in sequence After deleting a subset of corpus data, train the user intent recognition model to be trained to obtain multiple user intent recognition models.
  • each data in the full corpus data set as training sample data and as test sample data, respectively, corresponding to the difference in the average accuracy of the calculation model Value, sample recall rate difference, and prediction accuracy rate difference to obtain the sample contribution triples corresponding to each corpus data; if there are three differences in the sample contribution triples corresponding to the corpus data as negative values, get the corresponding
  • the automatic cleaning of negative contribution corpus data is realized, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
  • FIG. 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application;
  • FIG. 3 is a schematic block diagram of an apparatus for enhancing data features of corpus data provided by an embodiment of this application;
  • Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • Figure 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application
  • Figure 2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application.
  • the data feature enhancement method of corpus data is applied to the server, and the method is executed by the application software installed in the server.
  • the method includes steps S101 to S110.
  • S101 Receive a full corpus data set sent by a user terminal; wherein the full corpus data set includes multiple corpus data.
  • the client sends a full corpus data set to the server to filter the high-quality sample data with high sample contribution through the server, and then feed it back to the client, so that the client can be based on a high-quality sample data set.
  • the data set of quality sample data is used to train the model to be trained (for example, convolutional neural network, BERT model, etc.).
  • the full corpus data set is recorded as data set X.
  • the following takes only 20 pieces of corpus data included in the data set X as an example for illustration, but the data is in specific implementation
  • the corpus data included in set X are far greater than 20. Among them, the above 20 pieces of corpus data can be recorded as the i-th piece of corpus data, and the value range of i is a positive integer value in [1,20].
  • the total number of groups stored in the server needs to be obtained at this time.
  • the total number of groups is denoted as k.
  • the 20 corpus data in the full corpus data set is divided into 5 corpus data subsets according to the total group value of 4.
  • the set can be recorded as the j-th corpus data subset and the value range of j is the positive integer value in [1,5].
  • the following is to divide the first corpus data-the fifth corpus data into the first corpus data subset, and the sixth corpus data-the tenth corpus data into the second Corpus data subset, divide the 11th corpus data-the 15th corpus data into the 3rd corpus data subset, and divide the 16th corpus data-the 20th corpus data into the 4th corpus data subset as an example Let's continue to explain the follow-up process.
  • each corpus data in the single-round verification process can be used for training or testing the user intent recognition model multiple times, the following can be used to put it down: sequentially delete one of the corresponding partitions of the full corpus data set Subsets of the corpus data are respectively input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups.
  • step S103 includes:
  • the second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence.
  • the remaining second corpus data subset and the third corpus data subset are deleted for the first time.
  • the data subset of the corpus No. and the data subset of the No. 4 corpus form the first large-round first small-round training set, and the deleted No. 1 corpus data subset is used as the first large-round first small-round test set.
  • the user intention recognition model to be trained is trained through the first large round and the first small round training set, the first large round and the first small round user intention recognition model is obtained.
  • the remaining corpus data subset No. 1, the third corpus data subset, and the No. 4 corpus data subset form the second corpus data subset.
  • the deleted corpus data subset of No. 2 is used as the first-large-round and second-small-round test set.
  • the first large-round and second small-round user intention recognition model is obtained.
  • the remaining corpus data subset No. 1, the second corpus data subset, and the No. 4 corpus data subset form the third corpus data subset.
  • the deleted No. 3 corpus data subset is used as the first large round and third small round of test set.
  • the user intention recognition model to be trained is trained through the first large round and the third small round training set, the first large round and the third small round user intention recognition model is obtained.
  • the remaining corpus data subset No. 1, the second corpus data subset, and the No. 3 corpus data subset form the fourth corpus data subset.
  • the first large round and fourth small round user intention recognition model is obtained.
  • the user intent recognition models After deleting the subsets of the corpus data from the full corpus data set in the above order, after training the user intent recognition models to be trained respectively, the user intent recognition models with the same number as the total number of groups are obtained.
  • the sample contribution triples are averaged and correct by the model It is composed of the difference in rate, the difference in sample recall rate and the difference in prediction accuracy.
  • step S104 includes:
  • the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated
  • the model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;
  • the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated
  • the model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;
  • the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained.
  • the difference in accuracy is the difference.
  • the first piece of corpus data is used as the test data sample is the first large round of the first small round of training
  • the first piece of corpus data is used as the training data sample is the first large round of the second small round of training
  • the training data sample is the first large round of the second small round of training
  • the user intent recognition models obtained are the first large round and the second small round user intent recognition model, the first large round and the third small round user intent recognition model, and the first round.
  • the user intent recognition model in the fourth round of the big round when the first piece of corpus data is used as the test data sample, the user intent recognition model obtained is the user intent recognition model of the first big round and the first small round.
  • the first large-round and second small-round test set corresponding to the first large-round and second small-round user intent recognition model is used for model verification testing, and the first model of the first large-round and second small-round user intent recognition model is obtained
  • the correct rate where the correct rate of the first model is equal to the number of test data items predicted to be correct in the first large round and the second small round test set divided by the total number of data items in the first large round and the second small round test set; for example, the sixth corpus
  • the output value of the data input into the user intent recognition model in the first large round and the second small round is equal to the corresponding label value in the sixth corpus data, which means that the user intent recognition model in the first large round and the second small round correctly predicted the first round.
  • Results of 6 corpus data when the 7th corpus data, the 8th corpus data, and the 10th corpus data are input into the first round and the second round of the user intent recognition model, the correct results can be predicted respectively, and when the 9th corpus data After inputting into the first large round and the second small round user intention recognition model, the corresponding label value in the 9th corpus data cannot be predicted. At this time, the first model corresponding to the first large round and the second small round user intention recognition model is correct The rate is 80%.
  • the first large round and the first small round of the user intent recognition model corresponding to the first large round and the first small round of the test set Perform a model verification test to get the second model accuracy rate of the first large round and the first small round of the user intent recognition model (because the first corpus data is used as the test data sample, it only corresponds to one user intent recognition model, that is, The first round of the first small round of user intention recognition model, so the accuracy of the second model can be regarded as the average accuracy of the second model), where the accuracy of the second model is equal to the correct prediction in the first small round of the first large round Divide the number of test data by the total number of data in the first big round and the first small round of the test set; for example, the output value of the first corpus data input into the first big round and the first small round of the user intent recognition model is equal to the first one
  • the corresponding label value in the corpus data at this time indicates that the first large
  • the correct results can be predicted respectively, and when the fourth corpus data and the fifth corpus data After inputting into the first big round and the first small round user intention recognition model, the corresponding label value cannot be predicted.
  • the correct rate of the second model corresponding to the first big round and the first small round user intention recognition model is 60%. That is, the average correct rate of the second model is equal to 60%.
  • the average correct rate of the first model equal to 80% and obtaining the average correct rate of the second model equal to 60% in the above process
  • the difference is taken as the average accuracy difference corresponding to the first corpus data (at this time, the average accuracy difference is equal to 20%). That is to say, when calculating the difference in the average correctness rate corresponding to the i-th corpus data, you can refer to the calculation process of the difference in the average correctness rate of the first corpus data.
  • the average correctness difference corresponding to each corpus data it can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
  • the difference in sample recall rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
  • step S105 includes:
  • the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated
  • the sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;
  • the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated
  • the sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
  • the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
  • the recall rate of the first model corresponding to the user intent recognition model in the first large round and the third small round and the user intent recognition model in the first large round and the fourth small round is 20% (if the predicted intent of the first corpus data itself is A, Then the recall rate of the first model is calculated by dividing the actual number of test sample data with the model prediction result of A and the correct prediction in all test sample data corresponding to the user intention recognition model in the first large round and the second small round by dividing all the test sample data.
  • the prediction result of the middle model is the total number of test sample data of A), the recall rate of the second model is 40% (the specific calculation method refers to the calculation method of the recall rate of the first model), and the recall rate of the third model is 60% (the specific calculation method) Refer to the calculation method of the recall rate of the first model), so the recall rate of the first sample is obtained by averaging the recall rate of the first model, the recall rate of the second model and the recall rate of the third model, that is, the recall rate of the first sample It's 40%.
  • the fourth model recall rate corresponding to the first large round and the first small round user intention recognition model is 20%, then the fourth model recall rate can be used as the second sample recall
  • the difference of the sample recall rate corresponding to the first corpus data is 20%.
  • the difference in the prediction accuracy rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
  • step S106 includes:
  • the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated
  • the prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
  • the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated
  • the prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
  • the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
  • the prediction accuracy difference corresponding to the first corpus data it is also when the first corpus data is first calculated as the training data sample, the first large round and the second small round user intention recognition model,
  • the first prediction accuracy corresponding to the first large round and the third small round user intent recognition model and the first large round and the fourth small round user intent recognition model respectively is 100% (if the prediction result of the first corpus data itself is A, Then the calculation method of the first prediction accuracy is that the prediction result of the first corpus data in the first large round and the first small round of the user intent recognition model is A, and the first corpus data is in the first large round and the first small round
  • the total number of test data samples corresponding to the user intention recognition model is 1, then the number of correct prediction results of the first corpus data is divided by the first corpus data as the total number of test data samples to obtain the first prediction accuracy rate Is 100%), the second prediction accuracy rate is 100% (the specific calculation method refers to the calculation method of the first prediction accuracy rate), the third prediction accuracy rate is 100% (the
  • the first corpus data as the test data sample use the first large round and the second small round user intention recognition model, the first large round and the third small round user intention recognition model, and the first large round and the fourth small round user intent.
  • the fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate corresponding to the recognition model are averaged to obtain the second prediction average accuracy rate.
  • the calculation of the fourth prediction accuracy is that the prediction result of the first corpus data in the first large round and the second small round of the user intent recognition model is A, the first large round and the third small round of the user intent recognition model
  • the prediction result of 1 corpus data is A
  • the prediction result of the first corpus data in the first large round and the fourth small round of the user intent recognition model is A
  • the first corpus data is in the first large round and the second small round
  • the total number of test data samples corresponding to the user intent recognition model to the first large round and the fourth small round of the user intent recognition model is 3, then the total number of correct prediction results of the first corpus data is divided by the first corpus
  • the data is used as the total number of training data samples, and the fourth prediction accuracy rate is 100%.
  • the calculation methods of the fifth prediction accuracy rate and the sixth prediction accuracy rate refer to the calculation method of the fourth prediction accuracy rate mentioned above.
  • the fifth prediction accuracy rate is 100%
  • the sixth prediction accuracy rate is 100%
  • the first corpus The average second prediction accuracy rate corresponding to the data is 100% (obtained by averaging the fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate).
  • the prediction accuracy difference corresponding to the first corpus data is equal to the difference between the first prediction average accuracy rate and the second prediction average accuracy rate, that is, the prediction accuracy difference corresponding to the first corpus data is equal to zero.
  • all can refer to the calculation process of the difference in the prediction accuracy rate of the first corpus data.
  • each corpus data is a negative contribution sample
  • step S107 includes:
  • the difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
  • the sample corresponding to the first corpus data The contribution triple is [20%, 20%, 0]. Similarly, after the first round of verification tests are completed, it is possible to know the sample contribution triples corresponding to any i-th corpus data in the data set X.
  • S108 Determine whether there is a difference in the average correctness rate, the sample recall rate and the prediction accuracy rate difference in the sample contribution triples corresponding to the corpus data, which are all negative values.
  • the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference in the sample contribution triples corresponding to a piece of corpus data are all negative values, it means that the corpus data is used as training data There is a high probability that training the user intent recognition model will not make a beneficial contribution. At this time, you can consider deleting the corpus data from the full corpus data set to improve the training data quality of the updated full corpus data set.
  • the three rates that is, the difference in average accuracy, the difference in sample recall, and the difference in prediction accuracy
  • these targets can form the corpus data set to be deleted, and the corpus data in the corpus data set to be deleted can be deleted from the full corpus data set to improve the data quality of the full corpus data set.
  • the full corpus data set when the corpus data set to be deleted is deleted from the full corpus data set, the full corpus data set has changed at this time. Compared with the full corpus data set initially obtained in step S101, the current The total number of corpus data in the full corpus data set of the state is less than or equal to the total number of corpus data in the full corpus data set initially acquired in step S101.
  • This updated full corpus data set can be used as a simplified high-quality training set to continue training the user intent recognition model locally on the server to obtain a user intent recognition model with higher recognition accuracy.
  • step S110 the method further includes:
  • the amount of data may be reduced.
  • the current iteration number (wherein, the initial value of the current iteration number is 0), and add one to the current iteration number to update the current iteration number.
  • the maximum number of iterations is greater than 2, so after executing a round of samples After the data is filtered, you can continue to perform the step of supplementing the corpus data. That is, if the current number of iterations does not exceed the maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select from the local corpus pool the same total number of supplementary corpus data. Number of supplementary corpus data to form a supplementary corpus data set, so as to realize the data supplement of the full corpus data set in step S110 to update the data set.
  • step S101 After updating the complete corpus data set, return to step S101 to proceed to the next round of Data filtering.
  • the full corpus data set after the next round of data screening can enter the next round of data sample screening, it is necessary to add one to the current iteration number to update the current iteration number; then determine whether the current iteration number is Exceeds the preset maximum number of iterations (for example, if the maximum number of iterations is set to 10, then 10 rounds of corpus data supplementation can be performed), if the current number of iterations does not exceed the maximum number of iterations, return to step S101 to proceed again One round of data screening; if the current number of iterations exceeds the maximum number of iterations, the step of ending the process is executed. It can be seen that the automatic expansion of data samples in the data set is realized through the above-mentioned method. After that, the final full corpus data set can be input to the user intent recognition model to be trained for training, and the final user intent recognition model is obtained.
  • This method realizes the automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
  • the embodiment of the present application also provides a data feature enhancement device for corpus data, and the data feature enhancement device for corpus data is used to execute any embodiment of the aforementioned data feature enhancement method for corpus data.
  • FIG. 3 is a schematic block diagram of a data feature enhancement device for corpus data provided in an embodiment of the present application.
  • the data feature enhancement device 100 of the corpus data can be configured in a server.
  • the data feature enhancement device 100 for corpus data includes: a corpus data set acquisition unit 101, a data set division unit 102, a group training unit 103, an average correct rate difference calculation unit 104, and a sample recall rate difference calculation unit 105.
  • the corpus data set acquisition unit 101 is configured to receive a full corpus data set sent by a user terminal; wherein, the full corpus data set includes a plurality of corpus data.
  • the data set dividing unit 102 is configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups.
  • the group training unit 103 is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them into the user intention recognition model to be trained to obtain the same number of user intention recognition as the group total value.
  • Model wherein, after each round of deleting one of the corpus data subsets corresponding to the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset As the test sample data.
  • the group training unit 103 includes:
  • the data set labeling unit is used to mark the full corpus data set as data set X, and the corpus data subsets divided by data set X are respectively recorded as the 1st corpus data subset to the kth corpus data subset, The corpus data subset between the 1st corpus data subset and the kth corpus data subset is marked as the jth corpus data subset; the value of k is equal to the total number of groups, and the value of j is [1, k] Positive integer value in the interval;
  • the first deletion unit in the first round is used to delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the intention recognition of the user to be trained
  • the training set of the model is trained to obtain the first large round and the first small round user intention recognition model
  • the first small round of sequential deletion unit is used to sequentially delete the second corpus data subset to the kth corpus data subset from the full corpus data set to serve as the training set of the user intention recognition model to be trained Training is performed to obtain the user intention recognition model in the first large round and the second small round to the k-th small round user intent recognition model in the first large round in sequence.
  • the average correct rate difference calculation unit 104 is configured to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate and the average correct rate of the second model is calculated to obtain the average correct rate difference corresponding to each corpus data.
  • the average correct rate difference calculation unit 104 includes:
  • the first judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model; where the value of i The range is a positive integer value in [1,N], and N is equal to the total number of corpus data in the full corpus data set;
  • the first calculation unit is used to obtain the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the first The model correct rate corresponding to each first target user intent recognition model in the target user intent recognition model set is averaged to obtain the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data;
  • the second calculation unit is used to obtain the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model, and calculate the second The model accuracy rate corresponding to each second target user intent recognition model in the target user intent recognition model set is averaged to obtain the average accuracy rate of the second model corresponding to the i-th corpus data as the training sample data;
  • the first difference calculation unit is used to calculate the difference between the average correct rate of the first model corresponding to the i-th corpus data as the training sample data and the average correct rate of the second model corresponding to the i-th corpus data as the test sample data, Obtain the average correct rate difference corresponding to the i-th corpus data.
  • the sample recall rate difference calculation unit 105 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data corresponding to each user intent recognition model. The difference between this recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data.
  • the sample recall rate difference calculation unit 105 includes:
  • the second judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;
  • the third calculation unit is used to obtain the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the third The sample recall rate corresponding to each third target user intent recognition model in the target user intent recognition model set is averaged to obtain the corresponding first sample recall rate when the i-th corpus data is used as the training sample data;
  • the fourth calculation unit is used to calculate the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intent recognition model when the i-th corpus data is used as the test sample data.
  • the sample recall rate corresponding to each fourth target user intent recognition model in the target user intent recognition model set is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
  • the second difference calculation unit is used to calculate the difference between the first sample recall rate when the i-th corpus data is used as the training sample data and the second sample recall rate when the i-th corpus data is used as the test sample data to obtain The difference of the sample recall rate corresponding to the i-th corpus data.
  • the prediction accuracy difference calculation unit 106 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction corresponding to the test sample data of each user intent recognition model.
  • the difference between the average correct rate and the second predicted average correct rate is calculated to obtain the difference of the predicted correct rate corresponding to each corpus data.
  • the prediction accuracy difference calculation unit 106 includes:
  • the third judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;
  • the fifth calculation unit is used to obtain the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the fifth The prediction accuracy rate corresponding to each fifth target user intent recognition model in the target user intention recognition model set is averaged to obtain the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;
  • the sixth calculation unit is used to calculate the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model if the ith corpus data is used as the test sample data.
  • the prediction accuracy rate corresponding to each sixth target user intent recognition model in the target user intention recognition model set is averaged to obtain the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;
  • the third difference calculation unit is used to calculate the difference between the first predicted average correct rate when the i-th corpus data is used as the training sample data and the second predicted average correct rate when the i-th corpus data is used as the test sample data, Obtain the prediction accuracy difference corresponding to the i-th corpus data.
  • the sample contribution triple acquisition unit 107 is used to obtain the sample contribution triple corresponding to each corpus data according to the average accuracy difference, sample recall difference and prediction accuracy difference corresponding to each corpus data group.
  • the sample contribution triple acquisition unit 107 is further configured to:
  • the difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
  • the triple judging unit 108 is used for judging whether there is a sample contribution degree corresponding to the corpus data.
  • the average correctness rate difference, the sample recall rate difference and the prediction correctness rate difference are all negative.
  • the negative sample deletion unit 109 is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values , To form a corpus data set to be deleted.
  • the first data set update unit 110 is configured to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
  • the data feature enhancement device 100 of corpus data further includes:
  • the current iteration number update unit is used to obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
  • the current iteration number judging unit is used to determine whether the current iteration number exceeds a preset maximum iteration number
  • the automatic corpus acquisition unit is used to call the preset total number of supplementary corpus data if the current number of iterations does not exceed the preset maximum number of iterations, and randomly select from the local corpus pool that is equal to the total number of supplementary corpus data Supplementary corpus data with the same total number of data items to form a supplementary corpus data set;
  • the automatic corpus supplement unit is used to add the supplementary corpus data set to the full corpus data set to update the full corpus data set, and return to execute the step of obtaining the full corpus data set;
  • the process ending unit is used to end the process if the current iteration number exceeds the preset maximum iteration number.
  • the device realizes automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
  • the above-mentioned data feature enhancement device for corpus data can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 4.
  • FIG. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the data feature enhancement method of the corpus data.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the data feature enhancement method of corpus data.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the data feature enhancement method of corpus data disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged.
  • the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4, and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • a computer-readable storage medium In another embodiment of the present application, a computer-readable storage medium is provided.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the data feature enhancement method of corpus data disclosed in the embodiments of the present application.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a data feature enhancement method and apparatus for corpus data, a computer device, and a storage medium, relating to artificial intelligence technology. The method comprises: after a full corpus data set is acquired, firstly carrying out data grouping to obtain a plurality of corpus data subsets; every time one corpus data subset is sequentially deleted, training a user intention recognition model to be trained in order to obtain a plurality of user intention recognition models; taking each piece of data in the full corpus data set as training sample data and test sample data, and correspondingly calculating a model average correction rate difference value, a sample recall rate difference value and a prediction correction rate difference value, respectively, to acquire a sample contribution degree triple corresponding to each piece of corpus data; and if there is corpus data, three difference values in a corresponding sample contribution degree triple of which are negative values, acquiring target corpus data to form a corpus data set to be deleted, and then deleting said corpus data set from the full corpus data set. By means of the method, the apparatus, the computer device and the storage medium, automatic cleaning of negative-contribution corpus data is achieved, and no human intervention is needed in the cleaning process, thereby improving the acquisition efficiency of a high-quality training set.

Description

语料数据的数据特征增强方法、装置、计算机设备及存储介质Data feature enhancement method, device, computer equipment and storage medium of corpus data
本申请要求于2020年8月5日提交中国专利局、申请号为202010777836.8,申请名称为“语料数据的数据特征增强方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 5, 2020, the application number is 202010777836.8, and the application title is "Data feature enhancement method, device and computer equipment for corpus data", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及人工智能的模型托管技术领域,尤其涉及一种语料数据的数据特征增强方法、装置、计算机设备及存储介质。This application relates to the technical field of artificial intelligence model hosting, and in particular to a method, device, computer equipment and storage medium for enhancing data features of corpus data.
背景技术Background technique
传统的对话型机器人,将语料数据对深度学习模型进行训练,从而完成用户意图识别等任务,其中训练语料的质量是影响模型效果的关键。语料的质量一般通过“质”和“量”两个方面进行衡量,“质”是确保语料的正确性以及不同意图之间的边界清晰,“量”是保障模型能充分的学习数据特征的分布,两者相辅相成,缺一不可。Traditional conversational robots train deep learning models with corpus data to complete tasks such as user intention recognition. The quality of the training corpus is the key to the effect of the model. The quality of the corpus is generally measured in two aspects: "quality" and "quantity". "Quality" is to ensure the correctness of the corpus and the boundaries between different intentions are clear. "Quantity" is to ensure that the model can fully learn the distribution of data features. , The two complement each other and are indispensable.
研发人员在整理训练数据时发现,在扩充训练集“量”时,增加一个样本进入训练集,并不一定带来正面的影响。When sorting out the training data, the R&D staff found that adding a sample to the training set when expanding the “quantity” of the training set does not necessarily bring a positive impact.
同时,发明人发现扩充训练语料也需要消耗大量的人力,即所需人力成本较高。这是因为当前语料数据清洗的工作几乎是人工完成的,这就导致获取高质量训练集的效率低下。At the same time, the inventor found that expanding the training corpus also requires a lot of manpower, that is, the required manpower cost is higher. This is because the current work of corpus data cleaning is almost done manually, which leads to low efficiency in obtaining high-quality training sets.
发明内容Summary of the invention
本申请实施例提供了一种语料数据的数据特征增强方法、装置、计算机设备及存储介质,旨在解决现有技术中扩充训练语料是人工完成,所需人力成本较高,而且扩充预料数据过程中的数据清洗过程也是人工完成,导致获取高质量训练集的效率低下的问题。The embodiments of the present application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data, aiming to solve the problem that the expansion of training corpus in the prior art is completed manually, which requires high labor costs and expands the process of predicting data The data cleaning process in is also done manually, which leads to the problem of low efficiency in obtaining high-quality training sets.
第一方面,本申请实施例提供了一种语料数据的数据特征增强方法,其包括:In the first aspect, an embodiment of the present application provides a data feature enhancement method for corpus data, which includes:
获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测 正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
第二方面,本申请实施例提供了一种语料数据的数据特征增强装置,其包括:In the second aspect, an embodiment of the present application provides a data feature enhancement device for corpus data, which includes:
语料数据集获取单元,用于获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;A corpus data set acquisition unit, configured to acquire a full corpus data set; wherein, the full corpus data set includes multiple corpus data;
数据集划分单元,用于调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;A data set dividing unit, configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups;
分组训练单元,用于依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;The group training unit is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. ; Wherein, after each round of deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as Test sample data;
平均正确率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;The average correct rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model average corresponding to the test sample data of each user intent recognition model. Calculate the difference between the correct rate and the average correct rate of the second model to obtain the average correct rate difference corresponding to each corpus data;
样本召回率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;The sample recall rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first sample corresponding to the test sample data of each user intent recognition model. The difference between the recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data;
预测正确率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;The prediction accuracy difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction average corresponding to the test sample data of each user intent recognition model. The difference between the correct rate and the average correct rate of the second prediction is calculated to obtain the difference of the prediction correct rate corresponding to each corpus data;
样本贡献度三元组获取单元,用于根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;The sample contribution degree triplet acquisition unit is used to obtain the sample contribution degree triplet corresponding to each corpus data according to the average correctness rate difference, the sample recall rate difference and the prediction accuracy difference corresponding to each corpus data ;
三元组判断单元,用于判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;The triple judgment unit is used to judge whether there is a sample contribution degree corresponding to the corpus data. The average accuracy difference, the sample recall difference and the prediction accuracy difference in the triples are all negative;
负样本删除单元,用于若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及The negative sample deletion unit is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values. To form a corpus data set to be deleted; and
数据集第一更新单元,用于将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The first update unit of the data set is used to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和 作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
本申请实施例提供了一种语料数据的数据特征增强方法、装置、计算机设备及存储介质,其中当获取了全量语料数据集后,是先进行数据分组得到多组语料数据子集,依序每删除一组语料数据子集后对待训练用户意图识别模型进行训练得到多个用户意图识别模型,将全量语料数据集中每一数据作为训练样本数据和作为测试样本数据,分别对应计算模型平均正确率差值、样本召回率差值和预测正确率差值以获取各语料数据对应的样本贡献度三元组;若有语料数据对应的样本贡献度三元组中三个差值为负值,获取对应的目标语料数据组成待删除语料数据集以从全量语料数据集中删除。实现了对负贡献语料数据的自动清洗,清洗过程无需人为干预,提升了高质量训练集的获取效率。The embodiments of the application provide a method, device, computer equipment and storage medium for enhancing data features of corpus data. After the full corpus data set is obtained, the data is grouped first to obtain multiple sets of corpus data subsets, each in sequence After deleting a subset of corpus data, train the user intent recognition model to be trained to obtain multiple user intent recognition models. Use each data in the full corpus data set as training sample data and as test sample data, respectively, corresponding to the difference in the average accuracy of the calculation model Value, sample recall rate difference, and prediction accuracy rate difference to obtain the sample contribution triples corresponding to each corpus data; if there are three differences in the sample contribution triples corresponding to the corpus data as negative values, get the corresponding The target corpus data of to form a corpus data set to be deleted to be deleted from the full corpus data set. The automatic cleaning of negative contribution corpus data is realized, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术 人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的语料数据的数据特征增强方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application;
图2为本申请实施例提供的语料数据的数据特征增强方法的流程示意图;2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application;
图3为本申请实施例提供的语料数据的数据特征增强装置的示意性框图;FIG. 3 is a schematic block diagram of an apparatus for enhancing data features of corpus data provided by an embodiment of this application;
图4为本申请实施例提供的计算机设备的示意性框图。Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1和图2,图1为本申请实施例提供的语料数据的数据特征增强方法的应用场景示意图;图2为本申请实施例提供的语料数据的数据特征增强方法的流程示意图,该语料数据的数据特征增强方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。Please refer to Figure 1 and Figure 2. Figure 1 is a schematic diagram of an application scenario of a method for enhancing data features of corpus data provided by an embodiment of this application; Figure 2 is a schematic flowchart of a method for enhancing data features of corpus data provided by an embodiment of this application. The data feature enhancement method of corpus data is applied to the server, and the method is executed by the application software installed in the server.
如图2所示,该方法包括步骤S101~S110。As shown in Figure 2, the method includes steps S101 to S110.
S101、接收用户端发送的全量语料数据集;其中,所述全量语料数据集中包括多个语料数据。S101. Receive a full corpus data set sent by a user terminal; wherein the full corpus data set includes multiple corpus data.
在本实施例中,用户端向服务器发送了一个全量语料数据集,以通过服务器筛选出其中样本贡献度较高的高质量样本数据后反馈至用户端,这样用户端中即可根据一个包括高质量样本数据的数据集去训练待训练模型(例如卷积神经网络,BERT模型等)。例如,将所述全量语料数据集记为数据集X,本申请中为了更简单的理解后续的技术方案,下面以数据集X中仅包括20条语料数据为例来说明,但具体实施时数据集X所包括的语料数据都是远大于20条的。其中,上述20条语料数据可以记为第i条语料数据且i的取值范围是[1,20]中的正整数取值。In this embodiment, the client sends a full corpus data set to the server to filter the high-quality sample data with high sample contribution through the server, and then feed it back to the client, so that the client can be based on a high-quality sample data set. The data set of quality sample data is used to train the model to be trained (for example, convolutional neural network, BERT model, etc.). For example, the full corpus data set is recorded as data set X. In this application, in order to understand the subsequent technical solutions more simply, the following takes only 20 pieces of corpus data included in the data set X as an example for illustration, but the data is in specific implementation The corpus data included in set X are far greater than 20. Among them, the above 20 pieces of corpus data can be recorded as the i-th piece of corpus data, and the value range of i is a positive integer value in [1,20].
S102、调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集。S102. Invoke a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups.
在本实施例中,为了将所述全量语料数据集(即数据集X)进行分组,此时需获取服务器中预先存储的分组总数值。例如,将分组总数值记为k,本申请中为了更简单的理解后续的技术方案,下面以分组总数值记为k中k=4为例来说明,但具体实施时数分组总数值不一定取值为4,也可以是其他正整数的取值。In this embodiment, in order to group the full corpus data set (ie, data set X), the total number of groups stored in the server needs to be obtained at this time. For example, the total number of groups is denoted as k. In this application, for a simpler understanding of the subsequent technical solutions, the following takes the total number of groups as k and k=4 as an example for illustration, but the total number of groups in specific implementation hours is not necessarily The value is 4, or other positive integer values.
由于数据集X中包括20条语料数据,且分组总数值k=4,则以根据分组总数值4将全量语料数据集中20条语料数据划分为5个语料数据子集,上述5个语料数据子集可以记为第j号语料数据子集且j的取值范围是[1,5]中的正整数取值。为了更简化的理解上述分组过程,下面以将第1条语料数据-第5条语料数据划分至第1号语料数据子集,将第6条语料数据-第10条语料数据划分至第2号语料数据子集,将第11条语料数据-第15条语料数据划分至第3号语料数据子集,将第16条语料数据-第20条语料数据划分至第4号语料数据子集为例来继续说明后续处理过程。Since the data set X includes 20 corpus data, and the total group value k=4, the 20 corpus data in the full corpus data set is divided into 5 corpus data subsets according to the total group value of 4. The set can be recorded as the j-th corpus data subset and the value range of j is the positive integer value in [1,5]. In order to understand the above grouping process more simplified, the following is to divide the first corpus data-the fifth corpus data into the first corpus data subset, and the sixth corpus data-the tenth corpus data into the second Corpus data subset, divide the 11th corpus data-the 15th corpus data into the 3rd corpus data subset, and divide the 16th corpus data-the 20th corpus data into the 4th corpus data subset as an example Let's continue to explain the follow-up process.
S103、依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待 训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据。S103. Delete one of the corpus data subsets corresponding to the full corpus data set and input them into the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each After one round of deletion of one of the corpus data subsets corresponding to the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data.
在本实施例中,为了增加单轮验证过程中每一语料数据均可以多次用于训练或测试用户意图识别模型,可以采用放下如下:依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型。通过这一交叉验证的方式,计算N个样本的贡献度只需要训练k个模型,降低了复杂度,提高了数据贡献度分析的效率。In this embodiment, in order to increase that each corpus data in the single-round verification process can be used for training or testing the user intent recognition model multiple times, the following can be used to put it down: sequentially delete one of the corresponding partitions of the full corpus data set Subsets of the corpus data are respectively input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. Through this cross-validation method, calculating the contribution of N samples only needs to train k models, which reduces the complexity and improves the efficiency of data contribution analysis.
在一实施例中,步骤S103包括:In an embodiment, step S103 includes:
将所述全量语料数据集记为数据集X,将数据集X所划分的语料数据子集分别记为第1号语料数据子集至第k号语料数据子集,第1号语料数据子集至第k号语料数据子集之间的语料数据子集记为第j号语料数据子集;其中k的取值等于分组总数值,j的取值是[1,k]区间内的正整数取值;Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value
将第1号语料数据子集从所述全量语料数据集中删除,将所述全量语料数据集中余下的其他语料数据子集作为所述待训练用户意图识别模型的训练集进行训练,得到第一大轮第一小轮用户意图识别模型;Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;
依序将第2号语料数据子集至第k号语料数据子集分别从全量语料数据集中删除后以作为所述待训练用户意图识别模型的训练集进行训练,依序得到第一大轮第二小轮用户意图识别模型至第一大轮第k小轮用户意图识别模型。The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.
在本实施例中,继续以k=4来说明,例如在第一次删除所述全量语料数据集对应划分的第1号语料数据子集后,剩余的第2号语料数据子集、第3号语料数据子集和第4号语料数据子集组成第一大轮第一小轮训练集,将所删除的第1号语料数据子集作为第一大轮第一小轮测试集。此时,通过第一大轮第一小轮训练集对所述待训练用户意图识别模型进行训练后,得到第一大轮第一小轮用户意图识别模型。In this embodiment, the explanation continues with k=4. For example, after the first corpus data subset corresponding to the division of the full corpus data set is deleted for the first time, the remaining second corpus data subset and the third corpus data subset are deleted for the first time. The data subset of the corpus No. and the data subset of the No. 4 corpus form the first large-round first small-round training set, and the deleted No. 1 corpus data subset is used as the first large-round first small-round test set. At this time, after the user intention recognition model to be trained is trained through the first large round and the first small round training set, the first large round and the first small round user intention recognition model is obtained.
之后第二次删除所述全量语料数据集对应划分的第2号语料数据子集后,剩余的第1号语料数据子集、第3号语料数据子集和第4号语料数据子集组成第一大轮第二小轮训练集,将所删除的第2号语料数据子集作为第一大轮第二小轮测试集。此时,通过第一大轮第二小轮训练集对所述待训练用户意图识别模型进行训练后,得到第一大轮第二小轮用户意图识别模型。After deleting the second corpus data subset corresponding to the full corpus data set for the second time, the remaining corpus data subset No. 1, the third corpus data subset, and the No. 4 corpus data subset form the second corpus data subset. In a large-round second-small-round training set, the deleted corpus data subset of No. 2 is used as the first-large-round and second-small-round test set. At this time, after training the user intention recognition model to be trained through the first large-round and second small-round training set, the first large-round and second small-round user intention recognition model is obtained.
然后第三次删除所述全量语料数据集对应划分的第3号语料数据子集后,剩余的第1号语料数据子集、第2号语料数据子集和第4号语料数据子集组成第一大轮第三小轮训练集,将所删除的第3号语料数据子集作为第一大轮第三小轮测试集。此时,通过第一大轮第三小轮训练集对所述待训练用户意图识别模型进行训练后,得到第一大轮第三小轮用户意图识别模型。Then after deleting the third corpus data subset corresponding to the full corpus data set for the third time, the remaining corpus data subset No. 1, the second corpus data subset, and the No. 4 corpus data subset form the third corpus data subset. In a large third round of training set, the deleted No. 3 corpus data subset is used as the first large round and third small round of test set. At this time, after the user intention recognition model to be trained is trained through the first large round and the third small round training set, the first large round and the third small round user intention recognition model is obtained.
最后第四次删除所述全量语料数据集对应划分的第4号语料数据子集后,剩余的第1号语料数据子集、第2号语料数据子集和第3号语料数据子集组成第一大轮第四小轮训练集,将所删除的第4号语料数据子集作为第一大轮第四小轮测试集。此时,通过第一大轮第四小轮训练集对所述待训练用户意图识别模型进行训练后,得到第一大轮第四小轮用户意图识别模型。Finally, after deleting the fourth corpus data subset corresponding to the full corpus data set for the fourth time, the remaining corpus data subset No. 1, the second corpus data subset, and the No. 3 corpus data subset form the fourth corpus data subset. A large round of the fourth small round of training set, the deleted No. 4 corpus data subset as the first large round of the fourth small round of test set. At this time, after training the user intention recognition model to be trained through the first large round and fourth small round training set, the first large round and fourth small round user intention recognition model is obtained.
通过上述依序从全量语料数据集中删除语料数据子集后分别对待训练用户意图识别模型进行训练后,得到了和分组总数值有相同个数的用户意图识别模型。After deleting the subsets of the corpus data from the full corpus data set in the above order, after training the user intent recognition models to be trained respectively, the user intent recognition models with the same number as the total number of groups are obtained.
S104、获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值。S104. Obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model, and the average correctness of the first model and the average correctness of the second model respectively corresponding to the test sample data of each user intent recognition model. Rate difference value to get the average correct rate difference value corresponding to each corpus data.
在本实施例中,从数据集X中第1条语料数据开始为例来说明数据集X中20条语料数据分别对应的样本贡献度三元组,其中样本贡献度三元组由模型平均正确率差值、样本召回 率差值和预测正确率差值组成。In this embodiment, starting from the first corpus data in the data set X as an example to illustrate the sample contribution triples corresponding to the 20 corpus data in the data set X, the sample contribution triples are averaged and correct by the model It is composed of the difference in rate, the difference in sample recall rate and the difference in prediction accuracy.
在一实施例中,步骤S104包括:In an embodiment, step S104 includes:
判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;其中,i的取值范围是[1,N]中的正整数取值,且N等于所述全量语料数据集中的语料数据总条数;Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;
若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第一目标用户意图识别模型集合,计算第一目标用户意图识别模型集合中各第一目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一模型平均正确率;If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;
若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第二目标用户意图识别模型集合,计算第二目标用户意图识别模型集合中各第二目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二模型平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;
将第i条语料数据作为训练样本数据时对应的第一模型平均正确率与第i条语料数据作为测试样本数据时对应的第二模型平均正确率求差,得到第i条语料数据对应的平均正确率差值。When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.
在本实施例中,例如第1条语料数据是作为测试数据样本是第一大轮第一小轮训练过程中,第1条语料数据是作为训练数据样本是第一大轮第二小轮训练过程中、第一大轮第三小轮训练过程中以及第一大轮第四小轮训练过程中。也即当第1条语料数据是作为训练数据样本时得到的用户意图识别模型分别是第一大轮第二小轮用户意图识别模型、第一大轮第三小轮用户意图识别模型和第一大轮第四小轮用户意图识别模型;当第1条语料数据是作为测试数据样本时,得到的用户意图识别模型是第一大轮第一小轮用户意图识别模型。In this embodiment, for example, the first piece of corpus data is used as the test data sample is the first large round of the first small round of training, the first piece of corpus data is used as the training data sample is the first large round of the second small round of training During the process, during the first big round and the third small round training process, and during the first big round and the fourth small round training process. That is, when the first corpus data is used as the training data sample, the user intent recognition models obtained are the first large round and the second small round user intent recognition model, the first large round and the third small round user intent recognition model, and the first round. The user intent recognition model in the fourth round of the big round; when the first piece of corpus data is used as the test data sample, the user intent recognition model obtained is the user intent recognition model of the first big round and the first small round.
此时由第一大轮第二小轮用户意图识别模型对应的第一大轮第二小轮测试集来进行模型验证测试,得到第一大轮第二小轮用户意图识别模型的第一模型正确率,其中第一模型正确率等于第一大轮第二小轮测试集中预测正确的测试数据条数除以第一大轮第二小轮测试集中的总数据条数;例如第6条语料数据输入至第一大轮第二小轮用户意图识别模型后的输出值等于第6条语料数据中对应的标注值,此时表示第一大轮第二小轮用户意图识别模型正确预测了第6条语料数据的结果。同理当第7条语料数据、第8条语料数据、第10条语料数据输入至第一大轮第二小轮用户意图识别模型后也能分别预测出正确的结果,且当第9条语料数据输入至第一大轮第二小轮用户意图识别模型后未能预测出第9条语料数据中对应的标注值,此时第一大轮第二小轮用户意图识别模型对应的第一模型正确率为80%。At this time, the first large-round and second small-round test set corresponding to the first large-round and second small-round user intent recognition model is used for model verification testing, and the first model of the first large-round and second small-round user intent recognition model is obtained The correct rate, where the correct rate of the first model is equal to the number of test data items predicted to be correct in the first large round and the second small round test set divided by the total number of data items in the first large round and the second small round test set; for example, the sixth corpus The output value of the data input into the user intent recognition model in the first large round and the second small round is equal to the corresponding label value in the sixth corpus data, which means that the user intent recognition model in the first large round and the second small round correctly predicted the first round. Results of 6 corpus data. Similarly, when the 7th corpus data, the 8th corpus data, and the 10th corpus data are input into the first round and the second round of the user intent recognition model, the correct results can be predicted respectively, and when the 9th corpus data After inputting into the first large round and the second small round user intention recognition model, the corresponding label value in the 9th corpus data cannot be predicted. At this time, the first model corresponding to the first large round and the second small round user intention recognition model is correct The rate is 80%.
参考上述过程得出第一大轮第三小轮用户意图识别模型对应的第二模型正确率为60%,且得出第一大轮第四小轮用户意图识别模型对应的第三模型正确率为100%后,可以计算第1条语料数据是作为训练数据样本对应的第一模型平均正确率为(80%+60%+100%)/3=80%。Refer to the above process to obtain the correct rate of the second model corresponding to the user intent recognition model in the first large round and the third small round, and obtain the correct rate of the third model corresponding to the user intent recognition model in the first large round and the fourth small round. After it is 100%, it can be calculated that the first corpus data is the average correct rate of the first model corresponding to the training data sample (80%+60%+100%)/3=80%.
在计算第1条语料数据是作为测试数据样本对应的第二模型平均正确率时,此时由第一大轮第一小轮用户意图识别模型对应的第一大轮第一小轮测试集来进行模型验证测试,得到第一大轮第一小轮用户意图识别模型的第二模型正确率(由于第1条语料数据是作为测试数据样本时,只对应了1个用户意图识别模型,也即第一大轮第一小轮用户意图识别模型,故第二模型正确率可以视为第二模型平均正确率),其中第二模型正确率等于第一大轮第一小轮测试集中预测正确的测试数据条数除以第一大轮第一小轮测试集中的总数据条数;例如第1条语料数据输入至第一大轮第一小轮用户意图识别模型后的输出值等于第1条语料数据中对应的标注值,此时表示第一大轮第一小轮用户意图识别模型正确预测了第1条语料数据的结果。同理当第2条语料数据、第3条语料数据输入至第一大轮第一小轮用户意图识别模型后也能分别预测出正确的结果,且当第4条语料数据和第5条语料数据输入至第一大轮第一小轮用户意图识别模型后未能预测出分别对应的标注值,此时第一大轮第一小轮用户意图识别模型对应的第二模型正确率为60%,也即第二模型平均正确率等于60%。When the first corpus data is calculated as the average correct rate of the second model corresponding to the test data sample, at this time, the first large round and the first small round of the user intent recognition model corresponding to the first large round and the first small round of the test set Perform a model verification test to get the second model accuracy rate of the first large round and the first small round of the user intent recognition model (because the first corpus data is used as the test data sample, it only corresponds to one user intent recognition model, that is, The first round of the first small round of user intention recognition model, so the accuracy of the second model can be regarded as the average accuracy of the second model), where the accuracy of the second model is equal to the correct prediction in the first small round of the first large round Divide the number of test data by the total number of data in the first big round and the first small round of the test set; for example, the output value of the first corpus data input into the first big round and the first small round of the user intent recognition model is equal to the first one The corresponding label value in the corpus data at this time indicates that the first large round and the first small round user intention recognition model correctly predicted the result of the first corpus data. Similarly, when the second corpus data and the third corpus data are input into the first large round and the first small round of user intent recognition model, the correct results can be predicted respectively, and when the fourth corpus data and the fifth corpus data After inputting into the first big round and the first small round user intention recognition model, the corresponding label value cannot be predicted. At this time, the correct rate of the second model corresponding to the first big round and the first small round user intention recognition model is 60%. That is, the average correct rate of the second model is equal to 60%.
在上述过程中获取了第一模型平均正确率等于80%、且获取了第二模型平均正确率等于60%之后,即可计算第一模型平均正确率80%与第二模型平均正确率60%之差以作为第1条语料数据对应的平均正确率差值(此时,平均正确率差值等于20%)。也就是在计算第i条语料数据对应的平均正确率差值时,均可参照第1条语料数据的平均正确率差值的计算过程。通过获取每一语料数据对应的平均正确率差值,其可作为判断语料数据是否为负贡献样本的评估指标之一。After obtaining the average correct rate of the first model equal to 80% and obtaining the average correct rate of the second model equal to 60% in the above process, you can calculate the average correct rate of the first model 80% and the average correct rate of the second model 60% The difference is taken as the average accuracy difference corresponding to the first corpus data (at this time, the average accuracy difference is equal to 20%). That is to say, when calculating the difference in the average correctness rate corresponding to the i-th corpus data, you can refer to the calculation process of the difference in the average correctness rate of the first corpus data. By obtaining the average correctness difference corresponding to each corpus data, it can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
S105、获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值。S105. Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Find the difference to get the difference of the sample recall rate corresponding to each corpus data.
在本实施例中,获取每一语料数据对应的样本召回率差值,其可作为判断语料数据是否为负贡献样本的评估指标之一。In this embodiment, the difference in sample recall rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
在一实施例中,步骤S105包括:In an embodiment, step S105 includes:
判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第三目标用户意图识别模型集合,计算第三目标用户意图识别模型集合中各第三目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一样本召回率;If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;
若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第四目标用户意图识别模型集合,计算第四目标用户意图识别模型集合中各第四目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二样本召回率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
将第i条语料数据作为训练样本数据时对应的第一样本召回率与第i条语料数据作为测试样本数据时对应的第二样本召回率求差,得到第i条语料数据对应的样本召回率差值。When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
在本实施例中,例如在计算第1条语料数据对应的样本召回率差值时,也是先计算第1条语料数据作为训练数据样本时,第一大轮第二小轮用户意图识别模型、第一大轮第三小轮用户意图识别模型和第一大轮第四小轮用户意图识别模型分别对应的第一模型召回率是20%(若第1条语料数据本身的预测意图是A,则第一模型召回率的计算方式是第一大轮第二小轮用户意图识别模型对应的所有测试样本数据中模型预测结果为A且预测正确的测试样本数据实际条数除以所有测试样本数据中模型预测结果为A的测试样本数据总条数)、第二模型召回率是40%(具体计算方式参考第一模型召回率的计算方式)、第三模型召回率是60%(具体计算方式参考第一模型召回率的计算方式),这样第一样本召回率是对上述第一模型召回率、第二模型召回率及第三模型召回率求平均值得到,即第一样本召回率是40%。之后计算第1条语料数据作为测试数据样本时,以第一大轮第一小轮用户意图识别模型对应的第四模型召回率是20%,则该第四模型召回率可作为第二样本召回率,第1条语料数据对应的样本召回率差值是20%。在计算第i条语料数据对应的样本召回率差值时,均可参照第1条语料数据的样本召回率差值的计算过程。In this embodiment, for example, when calculating the sample recall rate difference corresponding to the first corpus data, and also when the first corpus data is first calculated as the training data sample, the first large round and the second small round user intention recognition model, The recall rate of the first model corresponding to the user intent recognition model in the first large round and the third small round and the user intent recognition model in the first large round and the fourth small round is 20% (if the predicted intent of the first corpus data itself is A, Then the recall rate of the first model is calculated by dividing the actual number of test sample data with the model prediction result of A and the correct prediction in all test sample data corresponding to the user intention recognition model in the first large round and the second small round by dividing all the test sample data. The prediction result of the middle model is the total number of test sample data of A), the recall rate of the second model is 40% (the specific calculation method refers to the calculation method of the recall rate of the first model), and the recall rate of the third model is 60% (the specific calculation method) Refer to the calculation method of the recall rate of the first model), so the recall rate of the first sample is obtained by averaging the recall rate of the first model, the recall rate of the second model and the recall rate of the third model, that is, the recall rate of the first sample It's 40%. After calculating the first corpus data as the test data sample, the fourth model recall rate corresponding to the first large round and the first small round user intention recognition model is 20%, then the fourth model recall rate can be used as the second sample recall The difference of the sample recall rate corresponding to the first corpus data is 20%. When calculating the sample recall rate difference corresponding to the i-th corpus data, the calculation process of the sample recall rate difference of the first corpus data can be referred to.
S106、获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值。S106. Obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model, and the first prediction average correct rate and the second prediction average correct respectively corresponding to the test sample data of each user intent recognition model Rate difference value to get the difference value of prediction accuracy rate corresponding to each corpus data.
在本实施例中,获取每一语料数据对应的预测正确率差值,其可作为判断语料数据是否为负贡献样本的评估指标之一。In this embodiment, the difference in the prediction accuracy rate corresponding to each corpus data is obtained, which can be used as one of the evaluation indicators for judging whether the corpus data is a negative contribution sample.
在一实施例中,步骤S106包括:In an embodiment, step S106 includes:
判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为 训练样本数据时对应的第五目标用户意图识别模型集合,计算第五目标用户意图识别模型集合中各第五目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一预测平均正确率;If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第六目标用户意图识别模型集合,计算第六目标用户意图识别模型集合中各第六目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二预测平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
将第i条语料数据作为训练样本数据时对应的第一预测平均正确率与第i条语料数据作为测试样本数据时对应的第二预测平均正确率求差,得到第i条语料数据对应的预测正确率差值。When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
在本实施例中,例如在计算第1条语料数据对应的预测正确率差值时,也是先计算第1条语料数据作为训练数据样本时,第一大轮第二小轮用户意图识别模型、第一大轮第三小轮用户意图识别模型和第一大轮第四小轮用户意图识别模型分别对应的第一预测正确率是100%(若第1条语料数据本身的预测结果是A,则第一预测正确率的计算方式是第一大轮第一小轮用户意图识别模型中对第1条语料数据的预测结果是A,且第1条语料数据在第一大轮第一小轮用户意图识别模型对应的测试数据样本总条数是1,则用第1条语料数据的预测结果正确的条数除以第1条语料数据作为测试数据样本总条数,得到第一预测正确率是100%)、第二预测正确率是100%(具体计算方式参考第一预测正确率的计算方式)、第三预测正确率是100%(具体计算方式参考第一预测正确率的计算方式),这样第一预测平均正确率是对上述第一预测正确率、第二预测正确率及第三预测正确率求平均值得到,即第一预测平均正确率是100%。In this embodiment, for example, when calculating the prediction accuracy difference corresponding to the first corpus data, it is also when the first corpus data is first calculated as the training data sample, the first large round and the second small round user intention recognition model, The first prediction accuracy corresponding to the first large round and the third small round user intent recognition model and the first large round and the fourth small round user intent recognition model respectively is 100% (if the prediction result of the first corpus data itself is A, Then the calculation method of the first prediction accuracy is that the prediction result of the first corpus data in the first large round and the first small round of the user intent recognition model is A, and the first corpus data is in the first large round and the first small round The total number of test data samples corresponding to the user intention recognition model is 1, then the number of correct prediction results of the first corpus data is divided by the first corpus data as the total number of test data samples to obtain the first prediction accuracy rate Is 100%), the second prediction accuracy rate is 100% (the specific calculation method refers to the calculation method of the first prediction accuracy rate), the third prediction accuracy rate is 100% (the specific calculation method refers to the calculation method of the first prediction accuracy rate) In this way, the average first prediction accuracy rate is obtained by averaging the first prediction accuracy rate, the second prediction accuracy rate, and the third prediction accuracy rate, that is, the first prediction average accuracy rate is 100%.
之后计算第1条语料数据作为测试数据样本时,以第一大轮第二小轮用户意图识别模型、第一大轮第三小轮用户意图识别模型、第一大轮第四小轮用户意图识别模型分别对应的第四预测正确率、第五预测正确率、第六预测正确率求平均值即可得到第二预测平均正确率。其中,在计算第四预测正确率是第一大轮第二小轮用户意图识别模型中对第1条语料数据的预测结果是A、第一大轮第三小轮用户意图识别模型中对第1条语料数据的预测结果是A,第一大轮第四小轮用户意图识别模型中对第1条语料数据的预测结果是A,且第1条语料数据在第一大轮第二小轮用户意图识别模型至第一大轮第四小轮用户意图识别模型中对应的测试数据样本总条数是3,则用第1条语料数据的预测结果正确的总条数除以第1条语料数据作为训练数据样本总条数,得到第四预测正确率是100%。第五预测正确率和第六预测正确率的计算方式均参考上述第四预测正确率的计算方式,例如第五预测正确率是100%,第六预测正确率是100%,则第1条语料数据对应的第二预测平均正确率是100%(由第四预测正确率、第五预测正确率和第六预测正确率求平均值得到)。此时第1条语料数据对应的预测正确率差值等于第一预测平均正确率与第二预测平均正确率之差,即第1条语料数据对应的预测正确率差值等于0。在计算第i条语料数据对应的预测正确率差值时,均可参照第1条语料数据的预测正确率差值的计算过程。Later, when calculating the first corpus data as the test data sample, use the first large round and the second small round user intention recognition model, the first large round and the third small round user intention recognition model, and the first large round and the fourth small round user intent. The fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate corresponding to the recognition model are averaged to obtain the second prediction average accuracy rate. Among them, the calculation of the fourth prediction accuracy is that the prediction result of the first corpus data in the first large round and the second small round of the user intent recognition model is A, the first large round and the third small round of the user intent recognition model The prediction result of 1 corpus data is A, the prediction result of the first corpus data in the first large round and the fourth small round of the user intent recognition model is A, and the first corpus data is in the first large round and the second small round The total number of test data samples corresponding to the user intent recognition model to the first large round and the fourth small round of the user intent recognition model is 3, then the total number of correct prediction results of the first corpus data is divided by the first corpus The data is used as the total number of training data samples, and the fourth prediction accuracy rate is 100%. The calculation methods of the fifth prediction accuracy rate and the sixth prediction accuracy rate refer to the calculation method of the fourth prediction accuracy rate mentioned above. For example, the fifth prediction accuracy rate is 100%, and the sixth prediction accuracy rate is 100%, then the first corpus The average second prediction accuracy rate corresponding to the data is 100% (obtained by averaging the fourth prediction accuracy rate, the fifth prediction accuracy rate, and the sixth prediction accuracy rate). At this time, the prediction accuracy difference corresponding to the first corpus data is equal to the difference between the first prediction average accuracy rate and the second prediction average accuracy rate, that is, the prediction accuracy difference corresponding to the first corpus data is equal to zero. When calculating the difference in the prediction accuracy rate corresponding to the i-th corpus data, all can refer to the calculation process of the difference in the prediction accuracy rate of the first corpus data.
S107、根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组。S107: According to the difference of the average correctness rate, the difference of the sample recall rate, and the difference of the prediction correctness rate corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data.
在本实施例中,为了对每一语料数据是否为负贡献样本进行客观判断,此时需要先将每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值进行组合,以获取每一语料数据分别对应的样本贡献度三元组。In this embodiment, in order to make an objective judgment on whether each corpus data is a negative contribution sample, at this time, it is necessary to first perform the difference in the average correctness rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data. Combine to obtain the sample contribution triples corresponding to each corpus data.
在一实施例中,步骤S107包括:In an embodiment, step S107 includes:
将每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值依序串接,得到每一语料数据对应的样本贡献度三元组。The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
在本实施例中,在上述获取了第1条语料数据对应的模型平均正确率差值20%、样本召回率差值20%和预测正确率差值0后,第1条语料数据对应的样本贡献度三元组为 [20%,20%,0]。同样在进行完第一大轮的验证试验后,是可以获知数据集X中任意第i条语料数据对应的样本贡献度三元组。In this embodiment, after obtaining the model average correct rate difference of 20%, sample recall rate difference of 20%, and prediction accuracy rate difference of 0 corresponding to the first corpus data, the sample corresponding to the first corpus data The contribution triple is [20%, 20%, 0]. Similarly, after the first round of verification tests are completed, it is possible to know the sample contribution triples corresponding to any i-th corpus data in the data set X.
S108、判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值。S108: Determine whether there is a difference in the average correctness rate, the sample recall rate and the prediction accuracy rate difference in the sample contribution triples corresponding to the corpus data, which are all negative values.
在本实施例中,当某一条语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,则表示该语料数据作为训练数据以训练用户意图识别模型是大概率不会做出有益贡献的,此时可以考虑将该语料数据从全量语料数据集中删除以提升更新后全量语料数据集的训练数据质量。In this embodiment, when the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference in the sample contribution triples corresponding to a piece of corpus data are all negative values, it means that the corpus data is used as training data There is a high probability that training the user intent recognition model will not make a beneficial contribution. At this time, you can consider deleting the corpus data from the full corpus data set to improve the training data quality of the updated full corpus data set.
当某一条语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值不均为负值,则表示该语料数据作为训练数据以训练用户意图识别模型是可能做出有益贡献的,可以继续保留在全量语料数据集中。When the average accuracy difference, sample recall difference and prediction accuracy difference in the sample contribution triples corresponding to a piece of corpus data are not all negative values, it means that the corpus data is used as training data to train user intention recognition The model is likely to make a useful contribution and can continue to be retained in the full corpus data set.
S109、若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集。S109. If there is a sample contribution triplet corresponding to the corpus data, the difference in the average accuracy rate, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus to be deleted data set.
在本实施例中,当获取了全量语料数据集中所有的样本贡献度三元组中三率(即平均正确率差值、样本召回率差值和预测正确率差值)均为负值的目标语料数据,这些目标可以组成待删除语料数据集,待删除语料数据集中的语料数据都可以从全量语料数据集中删除,以提升全量语料数据集的数据质量。In this embodiment, when all the sample contribution triples in the full corpus data set are obtained, the three rates (that is, the difference in average accuracy, the difference in sample recall, and the difference in prediction accuracy) are all negative targets Corpus data, these targets can form the corpus data set to be deleted, and the corpus data in the corpus data set to be deleted can be deleted from the full corpus data set to improve the data quality of the full corpus data set.
S110、将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。S110. Delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
在本实施例中,当将所述待删除语料数据集从所述全量语料数据集中删除后,此时全量语料数据集发生了变化,相较于步骤S101中初始获取的全量语料数据集,当前状态的全量语料数据集中语料数据的总数是小于或者等于步骤S101中初始获取的全量语料数据集中语料数据的总数。这一更新后的全量语料数据集可以作为一个精简高质量训练集在服务器本地用于继续训练用户意图识别模型,以得到识别准确率更高的用户意图识别模型。In this embodiment, when the corpus data set to be deleted is deleted from the full corpus data set, the full corpus data set has changed at this time. Compared with the full corpus data set initially obtained in step S101, the current The total number of corpus data in the full corpus data set of the state is less than or equal to the total number of corpus data in the full corpus data set initially acquired in step S101. This updated full corpus data set can be used as a simplified high-quality training set to continue training the user intent recognition model locally on the server to obtain a user intent recognition model with higher recognition accuracy.
在一实施例中,步骤S110之后还包括:In an embodiment, after step S110, the method further includes:
获取当前迭代次数,将所述当前迭代次数加一,以更新当前迭代次数;其中,当前迭代次数的初始值为0;Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
判断所述当前迭代次数是否超出预先设置的最大迭代次数;Judging whether the current number of iterations exceeds a preset maximum number of iterations;
若所述当前迭代次数未超出预先设置的最大迭代次数,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集;If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;
将所述补充语料数据集增加至所述全量语料数据集中,以更新全量语料数据集,返回执行所述获取全量语料数据集的步骤;Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;
若所述当前迭代次数超出预先设置的最大迭代次数,结束流程。If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
在本实施例中,由于执行到步骤S110进行了一轮样本数据筛选后,可能导致数据量的减少。为了确保数据集中的语料数据总量不变或者是发生增加,此时可以先判断是否还能进行补充语料数据的流程。In this embodiment, after performing a round of sample data screening at step S110, the amount of data may be reduced. In order to ensure that the total amount of corpus data in the data set remains unchanged or increases, you can first determine whether the process of supplementing corpus data can be performed.
即先获取当前迭代次数(其中,当前迭代次数的初始值为0),将所述当前迭代次数加一,以更新当前迭代次数,一般最大迭代次数是大于2的,故在执行完一轮样本数据筛选后,是可以继续执行补充语料数据的步骤。也即之后若所述当前迭代次数未超出所述最大迭代次数时,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集,从而实现对步骤S110中全量语料数据集的数据补充以更新该数据集,在更新完全量语料数据集后返回执行步骤S101以进行下一轮的数据筛选。经过下一轮数据筛选的全量语料数据集是否能进入再下一轮的数据样本筛选时,需先将将所述当前迭代次数加一,以更新当前迭代次数;之后判断所述当前迭代次数是否超出预先设置的最大迭代次数(例如设置最大迭代次数为10,则可以进行10轮语料数据的补充流程),若所述当前迭代次数未超出所述最大迭代次数,返回执行步骤S101 以进行再下一轮的数据筛选;若所述当前迭代次数超出所述最大迭代次数,执行结束流程的步骤。可见,通过上述方式实现了数据集中数据样本的自动扩充。之后所述获取最终的全量语料数据集可输入至待训练用户意图识别模型进行训练,得到最终的用户意图识别模型。That is, first obtain the current iteration number (wherein, the initial value of the current iteration number is 0), and add one to the current iteration number to update the current iteration number. Generally, the maximum number of iterations is greater than 2, so after executing a round of samples After the data is filtered, you can continue to perform the step of supplementing the corpus data. That is, if the current number of iterations does not exceed the maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select from the local corpus pool the same total number of supplementary corpus data. Number of supplementary corpus data to form a supplementary corpus data set, so as to realize the data supplement of the full corpus data set in step S110 to update the data set. After updating the complete corpus data set, return to step S101 to proceed to the next round of Data filtering. When the full corpus data set after the next round of data screening can enter the next round of data sample screening, it is necessary to add one to the current iteration number to update the current iteration number; then determine whether the current iteration number is Exceeds the preset maximum number of iterations (for example, if the maximum number of iterations is set to 10, then 10 rounds of corpus data supplementation can be performed), if the current number of iterations does not exceed the maximum number of iterations, return to step S101 to proceed again One round of data screening; if the current number of iterations exceeds the maximum number of iterations, the step of ending the process is executed. It can be seen that the automatic expansion of data samples in the data set is realized through the above-mentioned method. After that, the final full corpus data set can be input to the user intent recognition model to be trained for training, and the final user intent recognition model is obtained.
该方法实现了对负贡献语料数据的自动清洗,清洗过程无需人为干预,提升了高质量训练集的获取效率。This method realizes the automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
本申请实施例还提供一种语料数据的数据特征增强装置,该语料数据的数据特征增强装置用于执行前述语料数据的数据特征增强方法的任一实施例。具体地,请参阅图3,图3是本申请实施例提供的语料数据的数据特征增强装置的示意性框图。该语料数据的数据特征增强装置100可以配置于服务器中。The embodiment of the present application also provides a data feature enhancement device for corpus data, and the data feature enhancement device for corpus data is used to execute any embodiment of the aforementioned data feature enhancement method for corpus data. Specifically, please refer to FIG. 3, which is a schematic block diagram of a data feature enhancement device for corpus data provided in an embodiment of the present application. The data feature enhancement device 100 of the corpus data can be configured in a server.
如图3所示,语料数据的数据特征增强装置100包括:语料数据集获取单元101、数据集划分单元102、分组训练单元103、平均正确率差值计算单元104、样本召回率差值计算单元105、预测正确率差值计算单元106、样本贡献度三元组获取单元107、三元组判断单元108、负样本删除单元109、数据集第一更新单元110。As shown in FIG. 3, the data feature enhancement device 100 for corpus data includes: a corpus data set acquisition unit 101, a data set division unit 102, a group training unit 103, an average correct rate difference calculation unit 104, and a sample recall rate difference calculation unit 105. The prediction accuracy rate difference calculation unit 106, the sample contribution degree triplet acquisition unit 107, the triplet judgment unit 108, the negative sample deletion unit 109, and the data set first update unit 110.
语料数据集获取单元101,用于接收用户端发送的全量语料数据集;其中,所述全量语料数据集中包括多个语料数据。The corpus data set acquisition unit 101 is configured to receive a full corpus data set sent by a user terminal; wherein, the full corpus data set includes a plurality of corpus data.
数据集划分单元102,用于调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集。The data set dividing unit 102 is configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups.
分组训练单元103,用于依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据。The group training unit 103 is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them into the user intention recognition model to be trained to obtain the same number of user intention recognition as the group total value. Model; wherein, after each round of deleting one of the corpus data subsets corresponding to the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset As the test sample data.
在一实施例中,分组训练单元103包括:In an embodiment, the group training unit 103 includes:
数据集标号单元,用于将所述全量语料数据集记为数据集X,将数据集X所划分的语料数据子集分别记为第1号语料数据子集至第k号语料数据子集,第1号语料数据子集至第k号语料数据子集之间的语料数据子集记为第j号语料数据子集;其中k的取值等于分组总数值,j的取值是[1,k]区间内的正整数取值;The data set labeling unit is used to mark the full corpus data set as data set X, and the corpus data subsets divided by data set X are respectively recorded as the 1st corpus data subset to the kth corpus data subset, The corpus data subset between the 1st corpus data subset and the kth corpus data subset is marked as the jth corpus data subset; the value of k is equal to the total number of groups, and the value of j is [1, k] Positive integer value in the interval;
第一小轮第一删除单元,用于将第1号语料数据子集从所述全量语料数据集中删除,将所述全量语料数据集中余下的其他语料数据子集作为所述待训练用户意图识别模型的训练集进行训练,得到第一大轮第一小轮用户意图识别模型;The first deletion unit in the first round is used to delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the intention recognition of the user to be trained The training set of the model is trained to obtain the first large round and the first small round user intention recognition model;
第一小轮依序删除单元,用于依序将第2号语料数据子集至第k号语料数据子集分别从全量语料数据集中删除后以作为所述待训练用户意图识别模型的训练集进行训练,依序得到第一大轮第二小轮用户意图识别模型至第一大轮第k小轮用户意图识别模型。The first small round of sequential deletion unit is used to sequentially delete the second corpus data subset to the kth corpus data subset from the full corpus data set to serve as the training set of the user intention recognition model to be trained Training is performed to obtain the user intention recognition model in the first large round and the second small round to the k-th small round user intent recognition model in the first large round in sequence.
平均正确率差值计算单元104,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值。The average correct rate difference calculation unit 104 is configured to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate and the average correct rate of the second model is calculated to obtain the average correct rate difference corresponding to each corpus data.
在一实施例中,平均正确率差值计算单元104包括:In an embodiment, the average correct rate difference calculation unit 104 includes:
第一判断单元,用于判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;其中,i的取值范围是[1,N]中的正整数取值,且N等于所述全量语料数据集中的语料数据总条数;The first judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model; where the value of i The range is a positive integer value in [1,N], and N is equal to the total number of corpus data in the full corpus data set;
第一计算单元,用于若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第一目标用户意图识别模型集合,计算第一目标用户意图识别模型集合中各第一目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一模型平均正确率;The first calculation unit is used to obtain the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the first The model correct rate corresponding to each first target user intent recognition model in the target user intent recognition model set is averaged to obtain the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data;
第二计算单元,用于若第i条语料数据是作为各用户意图识别模型的测试样本数据,获 取第i条语料数据作为测试样本数据时对应的第二目标用户意图识别模型集合,计算第二目标用户意图识别模型集合中各第二目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二模型平均正确率;The second calculation unit is used to obtain the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model, and calculate the second The model accuracy rate corresponding to each second target user intent recognition model in the target user intent recognition model set is averaged to obtain the average accuracy rate of the second model corresponding to the i-th corpus data as the training sample data;
第一差值计算单元,用于将第i条语料数据作为训练样本数据时对应的第一模型平均正确率与第i条语料数据作为测试样本数据时对应的第二模型平均正确率求差,得到第i条语料数据对应的平均正确率差值。The first difference calculation unit is used to calculate the difference between the average correct rate of the first model corresponding to the i-th corpus data as the training sample data and the average correct rate of the second model corresponding to the i-th corpus data as the test sample data, Obtain the average correct rate difference corresponding to the i-th corpus data.
样本召回率差值计算单元105,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值。The sample recall rate difference calculation unit 105 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data corresponding to each user intent recognition model. The difference between this recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data.
在一实施例中,样本召回率差值计算单元105包括:In an embodiment, the sample recall rate difference calculation unit 105 includes:
第二判断单元,用于判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;The second judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;
第三计算单元,用于若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第三目标用户意图识别模型集合,计算第三目标用户意图识别模型集合中各第三目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一样本召回率;The third calculation unit is used to obtain the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the third The sample recall rate corresponding to each third target user intent recognition model in the target user intent recognition model set is averaged to obtain the corresponding first sample recall rate when the i-th corpus data is used as the training sample data;
第四计算单元,用于若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第四目标用户意图识别模型集合,计算第四目标用户意图识别模型集合中各第四目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二样本召回率;The fourth calculation unit is used to calculate the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intent recognition model when the i-th corpus data is used as the test sample data. The sample recall rate corresponding to each fourth target user intent recognition model in the target user intent recognition model set is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
第二差值计算单元,用于将第i条语料数据作为训练样本数据时对应的第一样本召回率与第i条语料数据作为测试样本数据时对应的第二样本召回率求差,得到第i条语料数据对应的样本召回率差值。The second difference calculation unit is used to calculate the difference between the first sample recall rate when the i-th corpus data is used as the training sample data and the second sample recall rate when the i-th corpus data is used as the test sample data to obtain The difference of the sample recall rate corresponding to the i-th corpus data.
预测正确率差值计算单元106,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值。The prediction accuracy difference calculation unit 106 is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate and the second predicted average correct rate is calculated to obtain the difference of the predicted correct rate corresponding to each corpus data.
在一实施例中,预测正确率差值计算单元106包括:In an embodiment, the prediction accuracy difference calculation unit 106 includes:
第三判断单元,用于判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;The third judging unit is used to judge whether the i-th piece of corpus data in the full corpus data set is used as training sample data for each user intent recognition model or as test sample data for each user intent recognition model;
第五计算单元,用于若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第五目标用户意图识别模型集合,计算第五目标用户意图识别模型集合中各第五目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一预测平均正确率;The fifth calculation unit is used to obtain the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data of each user intent recognition model, and calculate the fifth The prediction accuracy rate corresponding to each fifth target user intent recognition model in the target user intention recognition model set is averaged to obtain the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;
第六计算单元,用于若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第六目标用户意图识别模型集合,计算第六目标用户意图识别模型集合中各第六目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二预测平均正确率;The sixth calculation unit is used to calculate the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data of each user's intention recognition model if the ith corpus data is used as the test sample data. The prediction accuracy rate corresponding to each sixth target user intent recognition model in the target user intention recognition model set is averaged to obtain the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data;
第三差值计算单元,用于将第i条语料数据作为训练样本数据时对应的第一预测平均正确率与第i条语料数据作为测试样本数据时对应的第二预测平均正确率求差,得到第i条语料数据对应的预测正确率差值。The third difference calculation unit is used to calculate the difference between the first predicted average correct rate when the i-th corpus data is used as the training sample data and the second predicted average correct rate when the i-th corpus data is used as the test sample data, Obtain the prediction accuracy difference corresponding to the i-th corpus data.
样本贡献度三元组获取单元107,用于根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组。The sample contribution triple acquisition unit 107 is used to obtain the sample contribution triple corresponding to each corpus data according to the average accuracy difference, sample recall difference and prediction accuracy difference corresponding to each corpus data group.
在一实施例中,样本贡献度三元组获取单元107还用于:In an embodiment, the sample contribution triple acquisition unit 107 is further configured to:
将每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值依序串接,得到每一语料数据对应的样本贡献度三元组。The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
三元组判断单元108,用于判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值。The triple judging unit 108 is used for judging whether there is a sample contribution degree corresponding to the corpus data. The average correctness rate difference, the sample recall rate difference and the prediction correctness rate difference are all negative.
负样本删除单元109,用于若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集。The negative sample deletion unit 109 is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values , To form a corpus data set to be deleted.
数据集第一更新单元110,用于将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The first data set update unit 110 is configured to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
在一实施例中,语料数据的数据特征增强装置100还包括:In an embodiment, the data feature enhancement device 100 of corpus data further includes:
当前迭代次数更新单元,用于获取当前迭代次数,将所述当前迭代次数加一,以更新当前迭代次数;其中,当前迭代次数的初始值为0;The current iteration number update unit is used to obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
当前迭代次数判断单元,用于判断所述当前迭代次数是否超出预先设置的最大迭代次数;The current iteration number judging unit is used to determine whether the current iteration number exceeds a preset maximum iteration number;
语料自动获取单元,用于若所述当前迭代次数未超出预先设置的最大迭代次数,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集;The automatic corpus acquisition unit is used to call the preset total number of supplementary corpus data if the current number of iterations does not exceed the preset maximum number of iterations, and randomly select from the local corpus pool that is equal to the total number of supplementary corpus data Supplementary corpus data with the same total number of data items to form a supplementary corpus data set;
语料自动补充单元,用于将所述补充语料数据集增加至所述全量语料数据集中,以更新全量语料数据集,返回执行所述获取全量语料数据集的步骤;The automatic corpus supplement unit is used to add the supplementary corpus data set to the full corpus data set to update the full corpus data set, and return to execute the step of obtaining the full corpus data set;
流程结束单元,用于若所述当前迭代次数超出预先设置的最大迭代次数,结束流程。The process ending unit is used to end the process if the current iteration number exceeds the preset maximum iteration number.
该装置实现了对负贡献语料数据的自动清洗,清洗过程无需人为干预,提升了高质量训练集的获取效率。The device realizes automatic cleaning of negative contribution corpus data, and the cleaning process does not require human intervention, which improves the efficiency of obtaining high-quality training sets.
上述语料数据的数据特征增强装置可以实现为计算机程序的形式,该计算机程序可以在如图4所示的计算机设备上运行。The above-mentioned data feature enhancement device for corpus data can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 4.
请参阅图4,图4是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 4, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图4,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 4, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行语料数据的数据特征增强方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute the data feature enhancement method of the corpus data.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行语料数据的数据特征增强方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data feature enhancement method of corpus data.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的语料数据的数据特征增强方法。Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the data feature enhancement method of corpus data disclosed in the embodiment of the present application.
本领域技术人员可以理解,图4中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图4所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged. For example, in some embodiments, the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4, and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规 的处理器等。It should be understood that in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
在本申请的另一实施例中提供计算机可读存储介质。所述计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的语料数据的数据特征增强方法。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the data feature enhancement method of corpus data disclosed in the embodiments of the present application.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种语料数据的数据特征增强方法,其中,包括:A method for enhancing data features of corpus data, which includes:
    获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
    调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
    依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
    根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
    判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
    若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
    将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  2. 根据权利要求1所述的语料数据的数据特征增强方法,其中,所述将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集之后,还包括:The method for enhancing data features of corpus data according to claim 1, wherein after said deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:
    获取当前迭代次数,将所述当前迭代次数加一,以更新当前迭代次数;其中,当前迭代次数的初始值为0;Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
    判断所述当前迭代次数是否超出预先设置的最大迭代次数;Judging whether the current number of iterations exceeds a preset maximum number of iterations;
    若所述当前迭代次数未超出预先设置的最大迭代次数,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集;If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;
    将所述补充语料数据集增加至所述全量语料数据集中,以更新全量语料数据集,返回执行所述获取全量语料数据集的步骤;Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;
    若所述当前迭代次数超出预先设置的最大迭代次数,结束流程。If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
  3. 根据权利要求1所述的语料数据的数据特征增强方法,其中,所述依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型,包括:The method for enhancing data features of corpus data according to claim 1, wherein said sequentially deleting one of the corpus data subsets corresponding to the division of said full corpus data set is input to the user intent recognition model to be trained to obtain User intention recognition models that have the same number as the total number of groups, including:
    将所述全量语料数据集记为数据集X,将数据集X所划分的语料数据子集分别记为第1号语料数据子集至第k号语料数据子集,第1号语料数据子集至第k号语料数据子集之间的语料数据子集记为第j号语料数据子集;其中k的取值等于分组总数值,j的取值是[1,k]区间内的正整数取值;Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value
    将第1号语料数据子集从所述全量语料数据集中删除,将所述全量语料数据集中余下的其他语料数据子集作为所述待训练用户意图识别模型的训练集进行训练,得到第一大轮第一小轮用户意图识别模型;Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;
    依序将第2号语料数据子集至第k号语料数据子集分别从全量语料数据集中删除后以作为所述待训练用户意图识别模型的训练集进行训练,依序得到第一大轮第二小轮用户意图识别模型至第一大轮第k小轮用户意图识别模型。The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.
  4. 根据权利要求3所述的语料数据的数据特征增强方法,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值,包括:The method for enhancing data features of corpus data according to claim 3, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The average correct rate of the first model and the average correct rate of the second model corresponding to the sample data are calculated to obtain the difference of the average correct rate corresponding to each corpus data, including:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;其中,i的取值范围是[1,N]中的正整数取值,且N等于所述全量语料数据集中的语料数据总条数;Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第一目标用户意图识别模型集合,计算第一目标用户意图识别模型集合中各第一目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一模型平均正确率;If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第二目标用户意图识别模型集合,计算第二目标用户意图识别模型集合中各第二目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二模型平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;
    将第i条语料数据作为训练样本数据时对应的第一模型平均正确率与第i条语料数据作为测试样本数据时对应的第二模型平均正确率求差,得到第i条语料数据对应的平均正确率差值。When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.
  5. 根据权利要求4所述的语料数据的数据特征增强方法,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值,包括:The method for enhancing data features of corpus data according to claim 4, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The difference between the first sample recall rate and the second sample recall rate corresponding to the sample data is calculated to obtain the sample recall rate difference corresponding to each corpus data, including:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第三目标用户意图识别模型集合,计算第三目标用户意图识别模型集合中各第三目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一样本召回率;If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第四目标用户意图识别模型集合,计算第四目标用户意图识别模型集合中各第四目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二样本召回率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
    将第i条语料数据作为训练样本数据时对应的第一样本召回率与第i条语料数据作为测试样本数据时对应的第二样本召回率求差,得到第i条语料数据对应的样本召回率差值。When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
  6. 根据权利要求5所述的语料数据的数据特征增强方法,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值,包括:The data feature enhancement method of corpus data according to claim 5, wherein each corpus data in the full corpus data set is used as training sample data for each user intent recognition model and as a test for each user intent recognition model The difference between the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to the sample data respectively to obtain the prediction accuracy difference corresponding to each corpus data includes:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第五目标用户意图识别模型集合,计算第五目标用户意图识别模型集合中各第五目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为 训练样本数据时对应的第一预测平均正确率;If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第六目标用户意图识别模型集合,计算第六目标用户意图识别模型集合中各第六目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二预测平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
    将第i条语料数据作为训练样本数据时对应的第一预测平均正确率与第i条语料数据作为测试样本数据时对应的第二预测平均正确率求差,得到第i条语料数据对应的预测正确率差值。When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
  7. 根据权利要求1所述的语料数据的数据特征增强方法,其中,所述根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组,包括:The method for enhancing data features of corpus data according to claim 1, wherein the difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data is used to obtain each corpus data. The corresponding sample contribution triples include:
    将每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值依序串接,得到每一语料数据对应的样本贡献度三元组。The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
  8. 根据权利要求1所述的语料数据的数据特征增强方法,其中,所述判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值之后,还包括:The method for enhancing data features of corpus data according to claim 1, wherein said determining whether there is a difference in average correctness rate, sample recall rate difference, and prediction correctness rate difference in the sample contribution triples corresponding to the corpus data After the values are all negative, it also includes:
    若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值不均为负值,将该语料数据保留在全量语料数据集中。If there are corpus data corresponding to the sample contribution triples in the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference are not all negative values, the corpus data is retained in the full corpus data set.
  9. 一种语料数据的数据特征增强装置,其中,包括:A data feature enhancement device for corpus data, which includes:
    语料数据集获取单元,用于获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;A corpus data set acquisition unit, configured to acquire a full corpus data set; wherein, the full corpus data set includes multiple corpus data;
    数据集划分单元,用于调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;A data set dividing unit, configured to call a preset total number of groups to divide the full corpus data set into corresponding groups of corpus data subsets according to the total number of groups;
    分组训练单元,用于依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;The group training unit is used to sequentially delete one of the corpus data subsets corresponding to the full corpus data set and input them to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups. ; Wherein, after each round of deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as Test sample data;
    平均正确率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;The average correct rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first model average corresponding to the test sample data of each user intent recognition model. Calculate the difference between the correct rate and the average correct rate of the second model to obtain the average correct rate difference corresponding to each corpus data;
    样本召回率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;The sample recall rate difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first sample corresponding to the test sample data of each user intent recognition model. The difference between the recall rate and the second sample recall rate is calculated to obtain the sample recall rate difference corresponding to each corpus data;
    预测正确率差值计算单元,用于获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;The prediction accuracy difference calculation unit is used to obtain each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the first prediction average corresponding to the test sample data of each user intent recognition model. The difference between the correct rate and the average correct rate of the second prediction is calculated to obtain the difference of the prediction correct rate corresponding to each corpus data;
    样本贡献度三元组获取单元,用于根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;The sample contribution degree triplet acquisition unit is used to obtain the sample contribution degree triplet corresponding to each corpus data according to the average correctness rate difference, the sample recall rate difference and the prediction accuracy difference corresponding to each corpus data ;
    三元组判断单元,用于判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;The triple judgment unit is used to judge whether there is a sample contribution degree corresponding to the corpus data. The average accuracy difference, the sample recall difference and the prediction accuracy difference in the triples are all negative;
    负样本删除单元,用于若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及The negative sample deletion unit is used to obtain the corresponding target corpus data if the average accuracy difference, the sample recall difference and the prediction accuracy difference in the sample contribution triples corresponding to the corpus data are all negative values. To form a corpus data set to be deleted; and
    数据集第一更新单元,用于将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The first update unit of the data set is used to delete the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set.
  10. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上 运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer program:
    获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
    调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
    依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
    根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
    判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
    若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
    将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  11. 根据权利要求10所述的计算机设备,其中,所述将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集之后,还包括:The computer device according to claim 10, wherein after the deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:
    获取当前迭代次数,将所述当前迭代次数加一,以更新当前迭代次数;其中,当前迭代次数的初始值为0;Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
    判断所述当前迭代次数是否超出预先设置的最大迭代次数;Judging whether the current number of iterations exceeds a preset maximum number of iterations;
    若所述当前迭代次数未超出预先设置的最大迭代次数,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集;If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;
    将所述补充语料数据集增加至所述全量语料数据集中,以更新全量语料数据集,返回执行所述获取全量语料数据集的步骤;Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;
    若所述当前迭代次数超出预先设置的最大迭代次数,结束流程。If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
  12. 根据权利要求10所述的计算机设备,其中,所述依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型,包括:The computer device according to claim 10, wherein the sequence deletes one of the corpus data subsets corresponding to the full corpus data set and then respectively inputs them to the user intent recognition model to be trained to obtain the sum of the grouping values The same number of user intention recognition models, including:
    将所述全量语料数据集记为数据集X,将数据集X所划分的语料数据子集分别记为第1号语料数据子集至第k号语料数据子集,第1号语料数据子集至第k号语料数据子集之间的语料数据子集记为第j号语料数据子集;其中k的取值等于分组总数值,j的取值是[1,k]区间内的正整数取值;Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value
    将第1号语料数据子集从所述全量语料数据集中删除,将所述全量语料数据集中余下的其他语料数据子集作为所述待训练用户意图识别模型的训练集进行训练,得到第一大轮第一小轮用户意图识别模型;Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;
    依序将第2号语料数据子集至第k号语料数据子集分别从全量语料数据集中删除后以作 为所述待训练用户意图识别模型的训练集进行训练,依序得到第一大轮第二小轮用户意图识别模型至第一大轮第k小轮用户意图识别模型。The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of the k-th small round user intention recognition model.
  13. 根据权利要求12所述的计算机设备,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值,包括:The computer device according to claim 12, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the average correct rate of the first model and the average correct rate of the second model is calculated to obtain the difference of the average correct rate corresponding to each corpus data, including:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;其中,i的取值范围是[1,N]中的正整数取值,且N等于所述全量语料数据集中的语料数据总条数;Determine whether the i-th corpus data in the full corpus data set is used as the training sample data of each user intent recognition model, or as the test sample data of each user intent recognition model; where the value range of i is [1,N] The positive integer value in, and N is equal to the total number of corpus data in the full corpus data set;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第一目标用户意图识别模型集合,计算第一目标用户意图识别模型集合中各第一目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一模型平均正确率;If the i-th corpus data is used as the training sample data of each user's intent recognition model, the first target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the first target user intent recognition model set is calculated The model correctness rate corresponding to each first target user intent recognition model is to be averaged, and the average correct rate of the corresponding first model when the i-th corpus data is used as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第二目标用户意图识别模型集合,计算第二目标用户意图识别模型集合中各第二目标用户意图识别模型对应的模型正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二模型平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the second target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the second target user intent recognition model set is calculated The model correct rate corresponding to each second target user intent recognition model is averaged to obtain the average correct rate of the corresponding second model when the i-th corpus data is used as the training sample data;
    将第i条语料数据作为训练样本数据时对应的第一模型平均正确率与第i条语料数据作为测试样本数据时对应的第二模型平均正确率求差,得到第i条语料数据对应的平均正确率差值。When the i-th corpus data is used as the training sample data, the average correct rate of the first model corresponding to the average correct rate of the second model when the i-th corpus data is used as the test sample data is the difference, and the average corresponding to the i-th corpus data is obtained. The difference in accuracy.
  14. 根据权利要求13所述的计算机设备,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值,包括:The computer device according to claim 13, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the recall rate of the first sample and the recall rate of the second sample is calculated to obtain the difference of the sample recall rate corresponding to each corpus data, including:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第三目标用户意图识别模型集合,计算第三目标用户意图识别模型集合中各第三目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一样本召回率;If the i-th corpus data is used as the training sample data for each user's intent recognition model, the third target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the third target user intent recognition model set is calculated The sample recall rate corresponding to each third target user intention recognition model is averaged, and the corresponding first sample recall rate when the i-th corpus data is used as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第四目标用户意图识别模型集合,计算第四目标用户意图识别模型集合中各第四目标用户意图识别模型对应的样本召回率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二样本召回率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the fourth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the fourth target user intent recognition model set is calculated The sample recall rate corresponding to each fourth target user intention recognition model is averaged to obtain the second sample recall rate corresponding to the i-th corpus data as the training sample data;
    将第i条语料数据作为训练样本数据时对应的第一样本召回率与第i条语料数据作为测试样本数据时对应的第二样本召回率求差,得到第i条语料数据对应的样本召回率差值。When the i-th corpus data is used as the training sample data, the corresponding first sample recall rate and the i-th corpus data corresponding to the second sample recall rate when the i-th corpus data is used as the test sample data is the difference, and the sample recall corresponding to the i-th corpus data is obtained Rate difference.
  15. 根据权利要求14所述的计算机设备,其中,所述获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值,包括:The computer device according to claim 14, wherein each piece of corpus data in the full corpus data set is used as training sample data of each user intent recognition model and corresponding to the test sample data of each user intent recognition model. The difference between the average accuracy of the first prediction and the average accuracy of the second prediction is calculated to obtain the difference of the prediction accuracy corresponding to each corpus data, including:
    判断所述全量语料数据集中第i条语料数据是作为各用户意图识别模型的训练样本数据,或是作为各用户意图识别模型的测试样本数据;Determine whether the i-th corpus data in the full corpus data set is used as training sample data for each user's intention recognition model, or as test sample data for each user's intention recognition model;
    若第i条语料数据是作为各用户意图识别模型的训练样本数据,获取第i条语料数据作为训练样本数据时对应的第五目标用户意图识别模型集合,计算第五目标用户意图识别模型集合中各第五目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第一预测平均正确率;If the i-th corpus data is used as the training sample data for each user's intention recognition model, the fifth target user intent recognition model set corresponding to the i-th corpus data as the training sample data is obtained, and the fifth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each fifth target user intent recognition model is averaged, and the first prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
    若第i条语料数据是作为各用户意图识别模型的测试样本数据,获取第i条语料数据作为测试样本数据时对应的第六目标用户意图识别模型集合,计算第六目标用户意图识别模型集合中各第六目标用户意图识别模型对应的预测正确率以求平均值,得到第i条语料数据作为训练样本数据时对应的第二预测平均正确率;If the i-th corpus data is used as the test sample data of each user's intent recognition model, the sixth target user intent recognition model set corresponding to the i-th corpus data as the test sample data is obtained, and the sixth target user intent recognition model set is calculated The prediction accuracy rate corresponding to each sixth target user intent recognition model is averaged, and the second prediction average accuracy rate corresponding to the i-th corpus data as the training sample data is obtained;
    将第i条语料数据作为训练样本数据时对应的第一预测平均正确率与第i条语料数据作为测试样本数据时对应的第二预测平均正确率求差,得到第i条语料数据对应的预测正确率差值。When the i-th corpus data is used as the training sample data, the corresponding first prediction average correct rate and the i-th corpus data corresponding to the second prediction average correct rate when the i-th corpus data is used as the test sample data are calculated to obtain the prediction corresponding to the i-th corpus data The difference in accuracy.
  16. 根据权利要求10所述的计算机设备,其中,所述根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组,包括:The computer device according to claim 10, wherein the sample contribution degree corresponding to each corpus data is obtained according to the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference corresponding to each corpus data Triples, including:
    将每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值依序串接,得到每一语料数据对应的样本贡献度三元组。The difference in average correctness rate, sample recall rate, and prediction accuracy rate difference corresponding to each corpus data are concatenated in sequence to obtain the sample contribution triples corresponding to each corpus data.
  17. 根据权利要求10所述的计算机设备,其中,所述判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值之后,还包括:The computer device according to claim 10, wherein the difference in the average accuracy rate, the difference in the sample recall rate, and the difference in the prediction accuracy rate in the triad of sample contribution degrees corresponding to the corpus data are all negative values After that, it also includes:
    若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值不均为负值,将该语料数据保留在全量语料数据集中。If there are corpus data corresponding to the sample contribution triples in the average accuracy rate difference, the sample recall rate difference and the prediction accuracy rate difference are not all negative values, the corpus data is retained in the full corpus data set.
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:
    获取全量语料数据集;其中,所述全量语料数据集中包括多个语料数据;Acquiring a full corpus data set; wherein the full corpus data set includes multiple corpus data;
    调用预先设置的分组总数值,以根据所述分组总数值将所述全量语料数据集划分为对应组数的语料数据子集;Calling a preset group total value to divide the full corpus data set into corresponding groups of corpus data subsets according to the group total value;
    依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型;其中,每一轮删除所述全量语料数据集对应划分的其中一个语料数据子集后,该被删除的语料数据子集作为语料测试集,该被删除的语料数据子集中每一语料数据作为测试样本数据;One of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and input to the user intent recognition model to be trained to obtain the same number of user intent recognition models as the total number of groups; wherein, each round After deleting one of the corpus data subsets corresponding to the division of the full corpus data set, the deleted corpus data subset is used as the corpus test set, and each corpus data in the deleted corpus data subset is used as the test sample data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一模型平均正确率和第二模型平均正确率求差值,以得到每一语料数据对应的平均正确率差值;Obtain the average correct rate of the first model and the average correct rate of the second model corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data of each user intent recognition model. Difference to get the average correctness difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一样本召回率和第二样本召回率求差值,以得到每一语料数据对应的样本召回率差值;Obtain the first sample recall rate and the second sample recall rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Value to get the sample recall rate difference corresponding to each corpus data;
    获取所述全量语料数据集中每一语料数据作为各用户意图识别模型的训练样本数据、和作为各用户意图识别模型的测试样本数据分别对应的第一预测平均正确率和第二预测平均正确率求差值,以得到每一语料数据对应的预测正确率差值;Obtain the first prediction average accuracy rate and the second prediction average accuracy rate corresponding to each corpus data in the full corpus data set as the training sample data of each user intent recognition model and the test sample data as the test sample data of each user intent recognition model. Difference to get the difference of prediction accuracy rate corresponding to each corpus data;
    根据每一语料数据对应的平均正确率差值、样本召回率差值和预测正确率差值,获取每一语料数据分别对应的样本贡献度三元组;According to the difference in average accuracy rate, sample recall rate and prediction accuracy rate difference corresponding to each corpus data, obtain the sample contribution triples corresponding to each corpus data;
    判断是否存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值;Determine whether there is a sample contribution triple corresponding to the corpus data. The difference in average accuracy, sample recall, and prediction accuracy are all negative;
    若存在有语料数据对应的样本贡献度三元组中平均正确率差值、样本召回率差值和预测正确率差值均为负值,获取对应的目标语料数据,以组成待删除语料数据集;以及If there is a sample contribution triple corresponding to the corpus data, the average accuracy difference, the sample recall rate and the prediction accuracy difference are all negative, and the corresponding target corpus data is obtained to form the corpus data set to be deleted ;as well as
    将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集。The corpus data set to be deleted is deleted from the full corpus data set to update the full corpus data set.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述将所述待删除语料数据集从所述全量语料数据集中删除,以更新全量语料数据集之后,还包括:18. The computer-readable storage medium according to claim 18, wherein after the deleting the to-be-deleted corpus data set from the full corpus data set to update the full corpus data set, the method further comprises:
    获取当前迭代次数,将所述当前迭代次数加一,以更新当前迭代次数;其中,当前迭代次数的初始值为0;Obtain the current iteration number, and add one to the current iteration number to update the current iteration number; wherein, the initial value of the current iteration number is 0;
    判断所述当前迭代次数是否超出预先设置的最大迭代次数;Judging whether the current number of iterations exceeds a preset maximum number of iterations;
    若所述当前迭代次数未超出预先设置的最大迭代次数,调用预先设置的补充语料数据总条数,从本地语料池中随机抽取与所述补充语料数据总条数有相同总数据条数的补充语料数据,以组成补充语料数据集;If the current number of iterations does not exceed the preset maximum number of iterations, call the preset total number of supplementary corpus data, and randomly select supplements that have the same total number of data as the total number of supplementary corpus data from the local corpus Corpus data to form a supplementary corpus data set;
    将所述补充语料数据集增加至所述全量语料数据集中,以更新全量语料数据集,返回执行所述获取全量语料数据集的步骤;Adding the supplementary corpus data set to the full corpus data set to update the full corpus data set, and returning to execute the step of obtaining the full corpus data set;
    若所述当前迭代次数超出预先设置的最大迭代次数,结束流程。If the current number of iterations exceeds the preset maximum number of iterations, the process ends.
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述依序删除所述全量语料数据集对应划分的其中一个语料数据子集后分别输入至待训练用户意图识别模型,以得到和分组总数值有相同个数的用户意图识别模型,包括:18. The computer-readable storage medium according to claim 18, wherein the one of the corpus data subsets corresponding to the full corpus data set is sequentially deleted and then input to the user intent recognition model to be trained to obtain and group User intention recognition models with the same number of total values, including:
    将所述全量语料数据集记为数据集X,将数据集X所划分的语料数据子集分别记为第1号语料数据子集至第k号语料数据子集,第1号语料数据子集至第k号语料数据子集之间的语料数据子集记为第j号语料数据子集;其中k的取值等于分组总数值,j的取值是[1,k]区间内的正整数取值;Denote the full corpus data set as data set X, and denote the corpus data subsets divided by data set X as the first corpus data subset to the kth corpus data subset, and the first corpus data subset The corpus data subset between to the k-th corpus data subset is recorded as the j-th corpus data subset; the value of k is equal to the total number of groups, and the value of j is a positive integer in the interval [1,k] Value
    将第1号语料数据子集从所述全量语料数据集中删除,将所述全量语料数据集中余下的其他语料数据子集作为所述待训练用户意图识别模型的训练集进行训练,得到第一大轮第一小轮用户意图识别模型;Delete the No. 1 corpus data subset from the full corpus data set, and use the remaining corpus data subsets in the full corpus data set as the training set of the user intention recognition model to be trained for training, and get the first largest The first round of user intention recognition model;
    依序将第2号语料数据子集至第k号语料数据子集分别从全量语料数据集中删除后以作为所述待训练用户意图识别模型的训练集进行训练,依序得到第一大轮第二小轮用户意图识别模型至第一大轮第k小轮用户意图识别模型。The second corpus data subset to the kth corpus data subset are deleted from the full corpus data set in sequence, and then used as the training set of the user intention recognition model to be trained for training, and the first round of the first round is obtained in sequence. The second round of user intention recognition model to the first big round of k-th small round user intention recognition model.
PCT/CN2020/122842 2020-08-05 2020-10-22 Data feature enhancement method and apparatus for corpus data, computer device, and storage medium WO2021139317A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010777836.8A CN111914936B (en) 2020-08-05 2020-08-05 Data characteristic enhancement method and device for corpus data and computer equipment
CN202010777836.8 2020-08-05

Publications (1)

Publication Number Publication Date
WO2021139317A1 true WO2021139317A1 (en) 2021-07-15

Family

ID=73287205

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122842 WO2021139317A1 (en) 2020-08-05 2020-10-22 Data feature enhancement method and apparatus for corpus data, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN111914936B (en)
WO (1) WO2021139317A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117411969A (en) * 2023-12-14 2024-01-16 致讯科技(天津)有限公司 User perception evaluation method and device for non-target material

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634863B (en) * 2020-12-09 2024-02-09 深圳市优必选科技股份有限公司 Training method and device of speech synthesis model, electronic equipment and medium
CN112598326A (en) * 2020-12-31 2021-04-02 五八有限公司 Model iteration method and device, electronic equipment and storage medium
CN113111977B (en) * 2021-05-20 2021-11-09 润联软件系统(深圳)有限公司 Method and device for evaluating contribution degree of training sample and related equipment
CN113806485B (en) * 2021-09-23 2023-06-23 厦门快商通科技股份有限公司 Intention recognition method and device based on small sample cold start and readable medium
CN115098771A (en) * 2022-06-09 2022-09-23 阿里巴巴(中国)有限公司 Recommendation model updating method, recommendation model training method and computing device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940915A (en) * 2005-09-29 2007-04-04 国际商业机器公司 Corpus expansion system and method
CN104951469A (en) * 2014-03-28 2015-09-30 株式会社东芝 Method and device for optimizing corpus
US20160026634A1 (en) * 2014-07-28 2016-01-28 International Business Machines Corporation Corpus Quality Analysis
CN110134799A (en) * 2019-05-29 2019-08-16 四川长虹电器股份有限公司 A kind of text corpus based on BM25 algorithm build and optimization method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344395B (en) * 2018-08-30 2022-05-20 腾讯科技(深圳)有限公司 Data processing method, device, server and storage medium
CN110458207A (en) * 2019-07-24 2019-11-15 厦门快商通科技股份有限公司 A kind of corpus Intention Anticipation method, corpus labeling method and electronic equipment
CN111274797A (en) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 Intention recognition method, device and equipment for terminal and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940915A (en) * 2005-09-29 2007-04-04 国际商业机器公司 Corpus expansion system and method
CN104951469A (en) * 2014-03-28 2015-09-30 株式会社东芝 Method and device for optimizing corpus
US20160026634A1 (en) * 2014-07-28 2016-01-28 International Business Machines Corporation Corpus Quality Analysis
CN110134799A (en) * 2019-05-29 2019-08-16 四川长虹电器股份有限公司 A kind of text corpus based on BM25 algorithm build and optimization method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117411969A (en) * 2023-12-14 2024-01-16 致讯科技(天津)有限公司 User perception evaluation method and device for non-target material
CN117411969B (en) * 2023-12-14 2024-03-12 致讯科技(天津)有限公司 User perception evaluation method and device for non-target material

Also Published As

Publication number Publication date
CN111914936A (en) 2020-11-10
CN111914936B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
WO2021139317A1 (en) Data feature enhancement method and apparatus for corpus data, computer device, and storage medium
WO2021135477A1 (en) Probabilistic graphical model-based text attribute extraction method and apparatus, computer device and storage medium
CN109376873B (en) Operation and maintenance method, operation and maintenance device, electronic equipment and computer readable storage medium
CN113536081B (en) Data center data management method and system based on artificial intelligence
WO2021027153A1 (en) Method and apparatus for constructing traffic flow data analysis model
CN106302843B (en) A kind of IP address library update method and device
CN110647447B (en) Abnormal instance detection method, device, equipment and medium for distributed system
CN115150471B (en) Data processing method, apparatus, device, storage medium, and program product
WO2022246843A1 (en) Software project risk assessment method and apparatus, computer device, and storage medium
CN106909656A (en) Obtain the method and device of Text Feature Extraction model
CN109783459A (en) The method, apparatus and computer readable storage medium of data are extracted from log
CN116089870A (en) Industrial equipment fault prediction method and device based on meta-learning under small sample condition
US20220243347A1 (en) Determination method and determination apparatus for conversion efficiency of hydrogen production by wind-solar hybrid electrolysis of water
CN111310918A (en) Data processing method and device, computer equipment and storage medium
CN114580915B (en) Intelligent evaluation method and system for hair planting effect of novel microneedle technology
CN112257215B (en) Maximum likelihood estimation solving method and system for product life distribution parameters
CN114238106A (en) Test time prediction method and device, electronic device and storage medium
CN107071553A (en) A kind of method, device and computer-readable recording medium for changing video speech
CN113628077A (en) Method for generating non-repeated examination questions, terminal and readable storage medium
CN111611279A (en) Microwave assembly fault diagnosis system and method based on test index similarity
CN109993313A (en) Sample label processing method and processing device, community partitioning method and device
CN117131784B (en) Global agent optimization-based accelerated degradation test design method and device
CN111080394B (en) Matching method, device and storage medium
CN110619047B (en) Method and device for constructing natural language model and readable storage medium
WO2021051615A1 (en) Response method and apparatus based on artificial intelligence, computer device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911642

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911642

Country of ref document: EP

Kind code of ref document: A1