CN115827863A

CN115827863A - Method for processing training data, method for training model and medium

Info

Publication number: CN115827863A
Application number: CN202211538320.3A
Authority: CN
Inventors: 黄熙宇; 姚贡之
Original assignee: Shanghai Hongji Information Technology Co Ltd
Current assignee: Shanghai Hongji Information Technology Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-03-21

Abstract

The embodiment of the application provides a method for processing training data, a method for training a model and a medium, wherein the method comprises the following steps: screening a verification set from a data set, wherein the verification set is obtained by screening data in the data set according to a first principle, and the first principle is at least used for ensuring that a label category corresponding to the verification set is as close as possible to a label category corresponding to the data set; and taking the labels and the data in the data set except the verification set as a training set. By the aid of the method and the device, the technical problems that a multi-label data set is insufficient, large in difference and difficult to standardize in natural language processing can be effectively solved, and the quality of the obtained training data is improved, so that a model (for example, a natural language processing model) obtained by training the training data is better in performance.

Description

Method for processing training data, method for training model and medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method for processing training data, a method for training a model, and a medium.

Background

When a natural language processing system is actually deployed, multi-label data, that is, each input data, is often faced with a plurality of labels of the same type. For example, the web portal news can be attributed to different columns (such as political, sports, social, scientific, financial and other columns), and an article describes the political and financial contents, so that the article may be labeled with a political label and a financial label.

Multitasking natural language processing systems require a standardized multi-labeled data set during training. The natural language processing system needs to extract data with different proportions as a standardized training set and a standardized verification set in the training process, and different sets need to ensure all labels containing different tasks as much as possible and ensure the balance of label proportions. Unbalanced data set partitioning can affect the system training or evaluation process, for example, incomplete training set tags can cause the system to be insensitive to missing tags, and incomplete validation set tags can cause the validation process to fail to correctly evaluate the actual level of the system. The standardized data set partitioning can avoid that the system is too sensitive or too insensitive to the data of a specific tag, and avoid errors of system decision caused by data problems.

General single-label dataset normalization generally randomly divides a training set and a verification set directly according to proportion, but in practical application, a multi-label dataset is difficult to randomly divide. Specifically, this approach has three disadvantages: 1) The label proportion of the divided data sets is greatly different from the original proportion; 2) The divided data set cannot contain all rare tags, namely the problem of insufficient tag categories exists; 3) The random data division is relatively inefficient, and cannot fully utilize the data, that is, a relatively robust (robust) system model cannot be trained by using the existing data.

Disclosure of Invention

Embodiments of the present application provide a method for processing training data, a method for training a model, and a medium, which can effectively solve technical problems of insufficient multi-label data set, large difference, and difficult standardization in natural language processing, and improve quality of obtained training data, so that a model (e.g., a natural language processing model) trained using the training data has better performance.

In a first aspect, an embodiment of the present application provides a method for processing training data, where the method includes: screening a verification set from a data set, wherein the verification set is obtained by screening data in the data set according to a first principle, and the first principle is at least used for ensuring that a label category corresponding to the verification set is as close as possible to a label category corresponding to the data set; and taking the labels and the data in the data set except the verification set as a training set.

According to some embodiments of the application, the verification set is firstly constructed, then the training set is constructed, and when the verification set is divided, the reasonability of the label types of the set is ensured, so that the reasonability of the division of the whole data set is ensured. The standardization method effectively solves the problem of unreasonable labels of multi-label data set division, and can make full use of the whole data set, so that the existing data is utilized to train a relatively robust system model.

In some embodiments, the first principle is further configured to ensure that a difference between a first ratio and a second ratio is smaller than a target threshold, wherein the first ratio is a ratio between a total number of all data included in the verification set and a total number of all data included in the data set, and the second ratio is a ratio between a total number of all tags included in the verification set and a total number of all tags included in the data set; or, the first ratio is a ratio of the total number of data included in the verification set to the total number of tags, and the second ratio is a ratio of the total number of data included in the data set to the total number of tags.

In some embodiments of the present application, the first principle still needs to ensure that the label proportion is as close as possible, and when the verification set is divided, the reasonability of the label type and the label proportion of the set is ensured, so that the reasonability of the division of the whole data set is ensured.

In some embodiments, the data set includes a plurality of data and a plurality of tags corresponding to each data; the screening of the verification set from the data set comprises: obtaining various different labels included in the data set to obtain a first label set; screening out at least one data corresponding to each label in the first label set from the data set to obtain a basic verification data set; and obtaining the verification set according to the basic verification data set.

Some embodiments of the application filter various different tags from the data set at first, and extract one or more data from the data set for the various different tags respectively and then use the extracted data as a basic verification data set, so that the verification set obtained according to the basic verification data set contains data corresponding to the various different tags in the data set, and the rationality of data set division is improved.

In some embodiments, said deriving said verification set from said basic verification data set comprises: if the total data volume in the basic verification data set is confirmed to be smaller than the target total data volume, taking partial data in a first data set as data of the basic verification data set to obtain a first verification data set, wherein the first data set is data which are included in the data set and are left except the basic verification data set; and obtaining the verification set according to the first verification data set.

Some embodiments of the application need to determine whether the total data volume of the obtained basic verification data set meets the requirements on the verification set data stream, so that the quality of the obtained verification set is improved, and further, the model obtained after training of the training set can be better evaluated according to the verification set.

In some embodiments, said deriving said verification set from said basic verification data set comprises: screening the data in the basic verification data set if the total data amount in the basic verification data set is determined to be larger than the target total data amount, so as to obtain the first verification data set; and obtaining the verification set according to the first verification data set.

Some embodiments of the present application screen data in the basic verification data set to obtain verification set data when the data amount in the basic verification data set is greater than the target total data amount, so that the data amount in the verification set meets the requirement.

In some embodiments, the screening the data in the basic verification data set includes: selecting at least m target data from the basic verification data set as a second verification data set, wherein the total number of tag categories corresponding to the m target data is greater than or equal to the total number of tag categories corresponding to any m data combinations remaining in the basic verification data set; and repeating the above processes until the total number of the data of the second verification data set is equal to the target total data amount, so as to obtain the first verification data set.

Some embodiments of the present application screen data in the basic verification data set under the condition that the tag category is guaranteed to be as consistent as possible with the basic verification data set or the tag category of the data set (for example, the ratio of the total amount of data in the corresponding data set to the total amount of different tags), so as to obtain verification set data, and thus the tag proportion of the verification data set is more reasonable.

In some embodiments, said deriving said validation set from said first set of validation data comprises: if the proportion of the confirmed labels is larger than a first set threshold value or the proportion of the confirmed labels is smaller than a second set threshold value, correcting part of data in the first verification data set to obtain the verification set; wherein the tag proportion is a ratio between a third ratio and a fourth ratio, the third ratio being a ratio of a total number of tags included in the first verification set to a total number of tags included in the data set, the fourth ratio being a ratio of a total number of data included in the first verification set to a total number of data included in the data set.

In some embodiments of the present application, in a case of determining the data amount of the verification set, if there is a case that the ratio of the number of tags is too large or too small compared to the tag ratio of the data set, the number of tags in the verification set needs to be aligned, so as to improve the quality of the obtained data of the verification set.

In some embodiments, the modifying the partial data in the first verification data set if the validation tag ratio is smaller than a second set threshold includes: replacing first data in the first verification data set with second data, wherein the second data belongs to data included in the data set other than data in the first verification data set, a tag category of the second data includes a tag category of the first data and a total number of tags corresponding to the first data is less than a total number of tags corresponding to the second data; and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is greater than or equal to the second set threshold value.

Some embodiments of the present application replace data in the validation set with data with more tags in the remaining data when the proportion of tags in the validation set is determined to be too small, so as to satisfy the first principle as much as possible and improve the reasonableness of the training data.

In some embodiments, the total number of tags corresponding to the second data is the most number of tags in the remaining data included in the data set other than the data in the first verification data set.

Some embodiments of the present application select the data with the largest number of tags to replace the data in the first verification data set in order to reduce the number of replacements when it is confirmed that the proportion of tags is too small.

In some embodiments, the modifying the partial data in the first verification data set if the validation tag ratio is greater than a first set threshold comprises: replacing third data in the first verification data set with fourth data, wherein the fourth data belongs to data included in the data set except for data in the first verification data set, the label category of the third data includes the label category of the fourth data, and the number of labels corresponding to the third data is greater than the number of labels corresponding to the fourth data; and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is less than or equal to the first set threshold.

Some embodiments of the present application select data in the verification set with a smaller number of tags from the remaining data to replace the data in the verification set when it is determined that the proportion of tags in the verification set is too large, so as to satisfy the first principle as much as possible and improve the reasonability of the training data.

In some embodiments, the number of tags corresponding to the fourth data is the least number of tags in the remaining data included in the data set other than the data in the first set of validation data.

Some embodiments of the application may reduce the number of replacements by selecting the data with the least number of tags to replace the data in the first verification data set when the verification set is determined to have an excessively large tag ratio.

In some embodiments, prior to said screening out the validation set from the dataset, the method further comprises: obtaining tags of which the occurrence frequency is smaller than a set threshold value and which are included in an original data set, and obtaining a rare tag set; resampling data corresponding to the rare label set to obtain a sampling result; and constructing the data set according to the sampling result.

Some embodiments of the present application resample the rare tag data to avoid losing data corresponding to these tags.

In a second aspect, some embodiments of the present application provide a method for training a natural language processing model, wherein a target natural language processing model is obtained by training and verifying the natural language processing model according to the verification set and the training set obtained as any of the embodiments of the first aspect.

In a third aspect, some embodiments of the present application provide a method for processing a text, and classify a file to be processed according to a target natural language model obtained by using the method provided in the second aspect, so as to obtain a classification result.

In some embodiments, after said obtaining the classification result, the method further comprises: and providing a retrieval result for the user or recommending a target file to the user according to the classification result.

In a fourth aspect, some embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, may implement the method as described in any of the first, second and third aspects above.

In a fifth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, may implement the method according to the first, second, and third aspects described above.

In a sixth aspect, some embodiments of the present application provide an apparatus for acquiring training data, the apparatus comprising: the verification set screening module is configured to screen a verification set from a data set, wherein the verification set is obtained by screening data in the data set according to a first principle, and the first principle is at least used for ensuring that a label category corresponding to the verification set is as close as possible to a label category corresponding to the data set; a training set acquisition module configured to take the labels and data in the data set except the verification set as a training set.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic diagram of training a natural language processing model according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for processing training data according to an embodiment of the present disclosure;

FIG. 3 is a second flowchart of a method for processing training data according to an embodiment of the present application;

FIG. 4 is a block diagram illustrating an apparatus for processing annotation data according to an embodiment of the present disclosure;

fig. 5 is a schematic composition diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Some embodiments of the application are at least for solving the unbalanced problem of data set label division, when the multi-label data set is standardized, a model for firstly dividing the verification set and then dividing the training set is adopted, and when the verification set is divided, the reasonability of the label type and the label proportion of the verification set is ensured, so that the reasonability of the whole data set division is ensured. The standardization method effectively solves the problem of unreasonable labels of multi-label data set division, and can make full use of the whole data set, so that the existing data is utilized to train a relatively robust system model. For example, in some embodiments of the present application, in the case that the label categories are guaranteed to be sufficient (for example, in some embodiments of the present application, the label categories in the basic verification data set of the basic verification data set are the same as the label categories in the data set), the screening of the verification set guarantees that the label proportion of the training set and the verification set is as close as possible to the whole data set, and the normalization of the multi-label data set is guaranteed to be reasonable. For example, some embodiments of the present application ensure that the training set and validation set are as free of tags as possible by the algorithm that resamples the data corresponding to rare tags and modifies the data in the first validation data set.

It will be appreciated that the multi-label dataset normalization approach provided by some embodiments of the present application exploits data for dataset partitioning, utilizes existing data, and trains a relatively robust (robust) system model.

Referring to fig. 1, fig. 1 is a schematic diagram of a target natural language processing model 140 obtained by training a natural language processing model 130 according to data of a training set 110 and a verification set 120 according to an embodiment of the present application, and in a training process, parameters of the model are adjusted according to a prediction result output by the natural language processing model in fig. 1, so as to obtain the target natural language processing model 140.

The multitasking natural language processing system as described in the background section requires a standardized multi-labeled data set during the training process. The natural language processing system needs to extract data with different proportions as a standardized training set and a standardized verification set in the training process, and different sets need to ensure all labels containing different tasks as much as possible and ensure the balance of label proportions. Unbalanced data set partitioning can affect the system training or evaluation process, for example, incomplete tags in the training set 110 can result in insensitivity of the system to missing tags, and incomplete tags in the validation set 120 can result in failure of the validation process to correctly evaluate the actual level of the system.

The standardized data set partitioning can avoid that the system is too sensitive or too insensitive to the data of a specific tag, and avoid errors of system decision caused by data problems. The related art standardization for a single-label data set generally randomly divides a training set and a verification set directly according to a proportion, but the random division mode for each data corresponding to multiple labels brings many defects pointed out in the background art. By adopting some embodiments of the application, at least the problems of insufficient multi-label data set, large difference and difficult standardization in natural language processing can be effectively solved.

The following describes, with reference to fig. 2, an exemplary process of the present application for dividing a data set into a training set 110 and a validation set 120 to complete data set normalization.

As shown in fig. 2, an embodiment of the present application provides a method for processing training data, which includes: s101, screening a verification set from a data set, and S102, taking the labels and data except the verification set in the data set as a training set.

It should be noted that, in some embodiments of the present application, the verification set referred to by S101 is obtained by screening data in the data set according to a first principle, and in some embodiments of the present application, the first principle is at least used to ensure that the tag class corresponding to the verification set is as close as possible to the tag class corresponding to the data set, where the close as possible refers to that the total number of the types of the different tags and the specific class of each different tag are as similar as possible. In other embodiments of the present application, the first rule is at least used to ensure that the total number of all tag classes corresponding to the verification set is as close as possible to the total number of all tag classes corresponding to the data set and also used to ensure that a difference between a first ratio and a second ratio is less than a target threshold, wherein the first ratio is a ratio between the total number of all data included in the verification set and the total number of all data included in the data set, and the second ratio is a ratio between the total number of all tags included in the verification set and the total number of all tags included in the data set; or, the first ratio is a ratio of the total number of data included in the verification set to the total number of tags, and the second ratio is a ratio of the total number of data included in the data set to the total number of tags.

According to some embodiments of the application, firstly, a verification set is constructed according to a first principle, then data in the data set except the verification set and labels corresponding to the data are used as training sets, and when the verification set is divided, the reasonability of the label type and the label proportion of the set is ensured, so that the reasonability of the division of the whole data set is ensured. The standardization method effectively solves the problem of unreasonable labels of multi-label data set division, and can make full use of the whole data set, so that the existing data is utilized to train a relatively robust system model.

The implementation of the above S101 is exemplarily set forth below with reference to fig. 3.

S201, performing label statistics on all data in the original data set, counting rare labels, and resampling the data of the rare labels to construct a data set.

That is, in some embodiments of the present application, before the screening out the validation set from the dataset at S101, the method further comprises: obtaining tags with the occurrence frequency smaller than a set threshold value, which are included in an original data set, to obtain a rare tag set; resampling data corresponding to the rare label set to obtain a sampling result; and constructing the data set according to the sampling result. Some embodiments of the present application resample the rare tag data to avoid losing data corresponding to these tags.

For example, label statistics is performed on labels corresponding to all data in the original data set, rare labels (the number of labels is too small, or the proportion of the labels to all the labels is too small) are counted, and data of the rare labels is resampled, and a new multi-label data set is constructed to obtain a data set by using a sampling enhancement algorithm including, but not limited to, a SMOTE (Synthetic minimal Oversampling Technique) algorithm, and the like.

S202, determining the division ratio of the training set and the verification set in the data set to obtain the target total data volume of the verification set.

For example, if the division ratio of the data amounts included in the training set and the verification set is 9, and the data volume of the data set is 1000, the total target data amount corresponding to the verification set is 1000 × 0.1=100, that is, the total amount of data included in the finally obtained verification set is 100, and if one text is taken as one data, the verification set includes 100 collected texts and all tags corresponding to the texts.

S203, for each label in the data set, respectively finding one or more pieces of multi-label data from the data set, and adding the data into the basic verification data set.

That is, in some embodiments of the present application, the data set includes a plurality of data and a plurality of tags corresponding to each data, and the step S101 of filtering out the verification set from the data set exemplarily includes:

the method comprises the following steps of firstly, obtaining various different labels included in the data set to obtain a first label set.

And secondly, screening out at least one data corresponding to each label in the first label set from the data set to obtain a basic verification data set.

And thirdly, obtaining the verification set according to the basic verification data set. For example, in some embodiments of the present application, the third step directly takes the first label set and the basic verification data set as verification sets (i.e., the steps after S203 shown in fig. 3 are not performed, but the model is trained by directly taking the obtained basic verification data set as a final verification set). In some embodiments of the present application, it is further required to consider whether the data amount of the basic verification set satisfies the data amount required by S302, and in these embodiments, an operation such as adding or deleting data in the basic verification data set (i.e., performing one of subsequent S204 or S205) is further performed to obtain a verification set for verifying the model.

By executing S203 some embodiments of the present application, various different tags are first screened from the data set, and one or more data are respectively extracted from the data set for the various different tags, so that the extracted data are used as data in the basic verification data set, so that the verification set obtained according to the basic verification data set includes data corresponding to the various different tags in the data set, thereby improving the rationality of data set partitioning, and satisfying the requirements of the first principle.

And S204, judging whether the data volume in the basic verification data set is larger than the target total data volume, if so, continuing to execute S205, and if not (not including the step of not executing S204 or S205 if not, but directly executing S206), continuing to execute S206.

For example, first, randomly scattering all data in the data set (random shuffle), then finding out a multi-tag data for each tag, adding the multi-tag data to the basic verification data set, and de-duplicating the basic verification data set; according to the data quantity of the basic verification data set after the duplication removal, whether the data quantity meets the target total data quantity requirement obtained in the step S202 or not is performed, if the data quantity exceeds the target total data quantity, the data of the basic verification data set needs to be deleted, that is, the step S205 is performed; if the amount does not exceed the target total data amount, the data needs to be complemented, i.e., S206 is executed.

S205, the data in the basic verification data set is deleted to obtain a first verification data set, it should be noted that in some embodiments of the present application, the obtained data in the first verification data set and the tags corresponding to the data may also be used as a final verification set, and the model is verified according to the verification set.

That is, in some embodiments of the present application, the obtaining the verification set according to the basic verification data set exemplarily includes: screening the data in the basic verification data set if the total data amount in the basic verification data set is determined to be larger than the target total data amount, so as to obtain the first verification data set; and obtaining the verification set according to the first verification data set. For example, in some embodiments of the present application, the screening of the data in the basic verification data set exemplarily includes: selecting at least m target data from the basic verification data set as a second verification data set, wherein the total number of tag categories (i.e. the total number of types of different tags) corresponding to the m target data is greater than or equal to the total number of tag categories corresponding to any m data combinations remaining in the basic verification data set; and repeating the above processes until the total number of the data of the second verification data set is equal to the target total data amount, so as to obtain the first verification data set. Some embodiments of the present application screen data in the basic verification data set under the condition that the tag category is guaranteed to be as consistent as possible with the basic verification data set or the tag category of the data set (for example, the ratio of the total amount of data in the corresponding data set to the total amount of different tags), so as to obtain verification set data, and thus the tag proportion of the verification data set is more reasonable.

For example, the data amount of the basic verification data set exceeds the requirement of the data amount (i.e. the target total data amount) of the verification set, and the basic verification data set needs to be deleted. The deletion can use a beam search algorithm, first, K1 data are selected from the basic verification data set to make the label category the most, then K1 data are added to make the label category the most, and so on, a verification set requiring the data quantity is obtained, and a first verification data set is obtained. The beam search algorithm mainly guarantees that the verification set contains tags of all categories, but actual data cannot guarantee that the situation occurs certainly (the number of the verification sets is determined, but the types of the tags are possibly many, so that the verification set cannot contain tags of all types no matter how the data is selected), but the verification set can only contain tags of many categories as much as possible. And selecting from the basic verification data set, and combining the selected data with the previously selected data to maximize the label types of the selected data. The basic verification set is pruned, and K1 data are selected in each step until the data required by the verification set are reached. It can be understood that, because the data quantity of the basic verification data set exceeds the requirement of the verification set, to select data in the basic verification set to form the verification set, using the beam search algorithm, K1 groups are selected at a time to form a new set until data meeting the verification set quantity is selected. For example: assuming that 50 data are included in the basic verification data set, but the number of the verification sets (i.e., the target total data amount) is 30, and thus the number of the basic verification data sets exceeds the requirement, 30 data need to be selected from the 50 data. The method of picking uses the beam search, and assuming that K1 is 5, 5 data are picked each time until 30 data are reached. The beam search ensures that after each 5 data picks, all the picked data have the most tag types.

It should be noted that, in some embodiments of the present application, data in the first verification data set obtained after performing S205 and the labels corresponding to the data may be used as a final verification set, and the trained model is verified by using the verification set. In some embodiments of the present application, data in the first verification data set after the processing in S205 needs to be reprocessed to obtain final verification set data, that is, S207 in fig. 3 continues to be executed, and the verification set obtained after S207 is executed serves as the final verification set to verify the model.

S206, complementing the basic verification data set to obtain a first verification data set, it should be noted that, in some embodiments of the present application, the data in the first verification data set and the tags corresponding to the data may also be obtained as a final verification set, and the model is verified according to the verification set.

In some embodiments of the present application, the obtaining the verification set according to the basic verification data set exemplarily includes: if the total data volume in the basic verification data set is confirmed to be smaller than the target total data volume, taking partial data in a first data set as data of the basic verification data set to obtain a first verification data set, wherein the first data set is data which are included in the data set and are left except the basic verification data set; and obtaining the verification set according to the first verification data set.

Some embodiments of the present application increase data in the basic verification data set on the principle that the label proportion is reasonable, so that the data amount in the verification set meets the requirement.

For example, it is determined that the data quantity of the basic verification data set is less than the requirement of the data quantity of the verification set, the basic verification data set needs to be complemented, a beam search algorithm is used to complement, K2 data are selected from the remaining data excluding the basic verification data set (for example, the K2 data are randomly selected from the remaining data excluding the basic verification data set), then K2 data are added to enable the tag type proportion to be closest to the tag type proportion of the whole data (i.e., the data set), and so on, data complementing the data quantity are obtained and added to the verification set. It should be noted that the basic verification data set is added with the selected data combination, and the label type ratio of the basic verification data set is closest to the label type ratio of the data set. And adding K2 data into the basic verification data set each time until the sum of the number reaches the requirement of the number of the verification sets. For example, if the total data amount in the basic verification data set is 10 and the requirement of the verification set data amount is 50 (i.e., the target total amount is 50), 40 data needs to be supplemented to the basic verification data set, and if K2 is 6, 6 data are added to the basic verification data set at a time from the last step of basic verification data set (basic verification data set + remaining data = whole data) by using the beam search algorithm until 50 data are obtained. The last time is not necessarily K2, as only the total 50 needs to be met, this example last time is to supplement 4 data (10 → 16 → 22 → 28 → 34 → 40 → 46 → 50).

It should be noted that, in some embodiments of the present application, data in the first verification data set obtained after performing S206 and tags corresponding to the data may be used as a final verification set, and the trained model may be verified by using the verification set. In some embodiments of the present application, data in the first verification data set after the processing in S206 needs to be reprocessed to obtain final verification set data, that is, S207 in fig. 3 continues to be executed, and the verification set obtained after the execution of S207 serves as the final verification set to verify the model.

In some embodiments of the present application, for a first verification data set obtained previously, in the case of determining the number of verification set data, if there is a case that the ratio of the number of tags is too large or too small compared to the ratio of the tags of the entire data set, it is necessary to align the number of tags of the verification set. I.e. the operation of S207 needs to be performed on the first verification data set.

And S207, aligning the number of the labels in the first verification data set to obtain a verification set.

That is, in some embodiments of the present application, the obtaining the verification set according to the first verification data set includes: if the proportion of the confirmed labels is larger than a first set threshold value or the proportion of the confirmed labels is smaller than a second set threshold value, correcting part of data in the first verification data set to obtain the verification set; wherein the tag proportion is a ratio between a third ratio and a fourth ratio, the third ratio being a ratio of a total number of tags included in the first verification set to a total number of tags included in the data set, the fourth ratio being a ratio of a total number of data included in the first verification set to a total number of data included in the data set. In some embodiments of the present application, in a case of determining the data amount of the verification set, if there is a case where the ratio of the number of tags is too large or too small compared to the ratio of the tags of the data set, the number of tags of the verification set needs to be aligned, so as to improve the quality of the obtained data of the verification set.

The following respectively illustrates how the data in the first verification data set are modified when the label proportion is too small.

In some embodiments of the present application, the modifying the partial data in the first verification data set if the tag proportion is determined to be smaller than the second set threshold (i.e. the tag proportion is too small) exemplarily comprises: replacing first data in the first verification data set with second data, wherein the second data belongs to data included in the data set other than data in the first verification data set, a tag category of the second data includes a tag category of the first data and a total number of tags corresponding to the first data is less than a total number of tags corresponding to the second data; and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is greater than or equal to the second set threshold value. It will be appreciated that the replacement data selected in each replacement is different. Some embodiments of the present application replace data in the validation set with data with more tags in the remaining data when the proportion of tags in the validation set is determined to be too small, so as to satisfy the first principle as much as possible and improve the reasonableness of the training data.

To reduce the number of replacements, in some embodiments of the present application, the total number of tags corresponding to the second data is the most of the remaining data included in the data set in addition to the data in the first verification data set. Some embodiments of the present application select the data with the largest number of tags to replace the data in the first verification data set in order to reduce the number of replacements when it is confirmed that the proportion of tags is too small.

It should be noted that in some embodiments of the present application, the label ratio is too small, which means that there is a large difference between the ratio of the number of labels in the first verification data set to the number of labels in the data set and the ratio of the number of data in the first verification data set to the data amount in the data set. Too large or too small here is generally a significant difference, for example an order of magnitude difference. For example, the data set has 100 data, and the first verification data set has 10 data, but the total number of tags corresponding to the data set is 1000, and the number of tags in the first verification data set is only 20. At this time, the ratio of the number of tags of the first verification data set to the number of tags of the data set is 20:1000, the ratio of the number of data in the first validation data set to the total number of data in the data set is 10:100, the difference is large, the label proportion is too small, and the order of magnitude difference exists.

For example, in some embodiments of the present application, if there is a situation where the tag ratio is too small, the number of tags in the verification set needs to be increased. And traversing the first verification data sets, for the multi-label data in each first verification data set, finding out the multi-label data which has the label number greater than that of the label and the maximum label number from the residual data combined by the removed first verification data, so that the label types contained in the verification sets are not changed before and after replacement, and if the multi-label data exists, performing replacement.

The following respectively illustrates how to modify the data in the first verification data set when the label ratio is too large.

In some embodiments of the present application, the modifying the partial data in the first verification data set if the ratio of the validation tags is greater than the first set threshold exemplarily includes: replacing third data in the first verification data set with fourth data, wherein the fourth data belongs to data included in the data set except for data in the first verification data set, the label category of the third data includes the label category of the fourth data, and the number of labels corresponding to the third data is greater than the number of labels corresponding to the fourth data; and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is less than or equal to the first set threshold. It will be appreciated that the replacement data selected in each replacement is different. Some embodiments of the present application select data in the verification set with a smaller number of tags from the remaining data to replace the data in the verification set when it is determined that the proportion of tags in the verification set is too large, so as to satisfy the first principle as much as possible and improve the reasonability of the training data.

To reduce the number of replacements, in some embodiments of the present application, the number of tags corresponding to the fourth data is the least of the remaining data included in the data set other than the data in the first verification data set. Some embodiments of the application may reduce the number of replacements by selecting the data with the least number of tags to replace the data in the first verification data set when the verification set is determined to have an excessively large tag ratio.

The following sets an example to illustrate the implications of an over-scaled tag. For example, the data set has 100 data, the first verification data set has 10 data, the total number of tags corresponding to the data set is 1000, and the number of tags corresponding to the first verification data set is 500, so that the ratio of the number of tags in the first verification data set to the number of tags corresponding to the data set is 500:1000, the ratio of the total amount of data of the first verification data set to the total amount of data comprised by the data set is 10:100, the difference between the two ratios is large, namely, the label ratio is too large, and the two ratios are different in magnitude. As can be seen from the above description, if there is a situation where the ratio of tags is too large, the number of tags in the first verification data set needs to be reduced. For example, the first verification data set is traversed, for each multi-label data in the first verification data set, a multi-label data with a label number smaller than the label and the minimum label number is found in the rest data except the first verification data set, so that the label type contained in the verification set is not changed before and after replacement, and if such multi-label data exists, replacement is performed.

It should be noted that the division ratio rate of the training set and the validation set is a hyper-parameter. For example, the division ratio rate can be set to 9. The above-described method of iterative sampling solves the problem of normalization of multi-label datasets (i.e., provides a way to process training data). By gradually determining the verification set division, the label category and the label proportion of the training set and the verification set are ensured to be close to the whole data as much as possible, the standardization reasonability of the data set is ensured, meanwhile, the data can be fully utilized, the capability of a model system is relatively improved, and the trained model is more robust.

The method for processing training data provided by the embodiment of the application is described below by taking a text as one data in a data set and combining two examples, and the method realizes the standardized processing of a universal multi-label data set.

Example 1

In a task of processing a type of text labels in natural language, different types of specific texts need to be labeled, generally, an article has multiple places to be labeled, and each labeled place needs to be labeled with a category label, so that an article (as an example of one data in a data set) contains multiple labels (i.e., one example of one data versus one multiple labels). Text annotation is a natural language processing task that requires multi-tag data.

When training a natural language processing model of text labels, the multi-labeled articles need to be divided into a training set and a verification set, and the number of the articles in each set is determined. Since each article contains a plurality of places to be labeled, each article also corresponds to a plurality of labels, and the number of labels contained in each article (as an example of one data in the verification set or the training set) is also greatly different. The number of articles containing a particular tag is also limited for a particular tagging task. If all articles are randomly assigned as in the related art, it is likely that the training set or the verification set will lack label categories, and the ratio of the articles in the training set and the verification set (i.e., an example of the ratio of the data in the training set to the data in the verification set) and the ratio of the labels (i.e., the ratio between the total number of labels corresponding to the training set and the total number of labels corresponding to the data in the verification set) are very different, and are very different from the expected division ratio.

Some embodiments of the present application address the problem of data set standardization in multi-label text labeling using the following approach.

Firstly, counting all tags in a data set, if the number of the tags is too few, even only one tag is included, resampling the data is needed, and a SMOTE algorithm is used to construct more tag data. In the second step, the verification set is divided, and one tenth of data needs to be taken out as the verification set according to the condition of the training set and the verification set ratio 9. And extracting the data in a disordered sequence, wherein one data is required to be extracted for each label, and only one data is reserved if repeated data is extracted. If the extracted data is more than the quantity required by the verification set, deleting the data, selecting the required quantity from all the extracted data by using a beam search algorithm and ensuring the maximum types of the labels; if the extracted data is less than the required number of the verification set, the number needs to be increased, and the beam search algorithm is used for selecting the required number from the rest data and ensuring that the proportion of the labels is closest to the proportion of the whole.

The last step requires aligning the ratio of the number of validation set data to the number of tags. Because each data contains different numbers of labels, if the data selected by the verification set contains more labels, the data number of the training set and the verification set is maintained to be 9:1, but the proportion of tags is much greater than 9. The data needs to be replaced as much as possible, and the data in the verification set is exchanged as much as possible under the condition that the label category is not changed, so that the label proportion is close to the data proportion.

After the verification set is divided, the rest data is training set data, and the standardization method simultaneously ensures that the proportion of the two set labels is as close to the proportion of the whole data as possible, and the standardization is finished.

Example 2

Published articles often require text classification by natural language processing models for search and recommendation. The content of many articles contains many areas, so text classification requires that such articles be classified into many categories. In the face of the natural language processing model which needs to process multi-label texts, most of the trained data is a multi-label data set. Due to the requirements of text formats, text word numbers and the like, the proportion of various labels in the text classification data set of multi-label data is often very different, for example, the proportion of articles in the categories of society, science and technology, finance and the like is much larger than that of articles in the categories of religion and law. On the other hand, some special tags may be strongly associated, for example, a multi-tag article of portal news, which generally contains educational content, is likely to contain social content.

If the partitions of the training set and the verification set are randomly drawn, the label category and the label proportion are greatly different. This can result in the model being too sensitive in certain categories during training, reducing the performance of the model.

Some embodiments of the application can be similar to the solution rule in embodiment 1 specifically, extract more verification set data with proper proportion with beam search, guarantee that the proportion is close to the whole data under the condition that the label of training set and verification set does not lack the classification, avoid the error of model prediction brought because of the data problem.

Referring to fig. 4, fig. 4 shows a device for acquiring training data according to an embodiment of the present application, and it should be understood that the device corresponds to the embodiment of the method in fig. 2, and is capable of performing various steps related to the embodiment of the method, and specific functions of the device may be referred to the description above, and detailed descriptions are appropriately omitted here to avoid redundancy. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in an operating system of the device, and the device for acquiring training data comprises: a verification set screening module 301 and a training set acquisition module 302.

The verification set screening module is configured to screen a verification set from a data set, wherein the verification set is obtained by screening data in the data set according to a first principle, and the first principle is at least used for ensuring that a tag class corresponding to the verification set is as close as possible to a tag class corresponding to the data set.

A training set acquisition module configured to take the labels and data in the data set except the verification set as a training set.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Some embodiments of the present application provide a method for training a natural language processing model, wherein a natural language processing model is trained and verified according to a verification set and a training set obtained according to any of the above embodiments, so as to obtain a target natural language processing model.

Some embodiments of the present application provide a method for processing a text, which classifies a file to be processed according to a target natural language model obtained by the method, and obtains a classification result.

Some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, may implement a method as described in any of the various method inclusive embodiments described above.

As shown in fig. 5, some embodiments of the present application provide an electronic device 500, which includes a memory 510, a processor 520, and a computer program stored on the memory 510 and executable on the processor 520, wherein the processor 520 can implement the method according to any of the embodiments when reading the program from the memory 510 through a bus 530 and executing the program.

Processor 520 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a structurally reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 520 may be a microprocessor.

Memory 510 may be used to store instructions that are executed by processor 520 or data related to the execution of the instructions. The instructions and/or data may include code for performing some or all of the functions of one or more of the modules described in embodiments of the application. The processor 520 of the disclosed embodiments may be used to execute instructions in the memory 510 to implement the method shown in fig. 2. Memory 510 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of processing training data, the method comprising:

screening a verification set from a data set, wherein the verification set is obtained by screening data in the data set according to a first principle, and the first principle is at least used for ensuring that a label category corresponding to the verification set is as close as possible to a label category corresponding to the data set;

and taking the labels and the data in the data set except the verification set as a training set.

2. The method of claim 1, wherein the first principle is further used to ensure that a difference between the first ratio and the second ratio is less than a target threshold,

wherein the first ratio is a ratio between a total number of all data included in the validation set and a total number of all data included in the data set, and the second ratio is a ratio between a total number of all tags included in the validation set and a total number of all tags included in the data set; or, the first ratio is a ratio of the total number of data included in the verification set to the total number of tags, and the second ratio is a ratio of the total number of data included in the data set to the total number of tags.

3. The method of any one of claims 1-2,

the screening of the verification set from the data set comprises:

obtaining various different labels included in the data set to obtain a first label set;

screening out at least one data corresponding to each label in the first label set from the data set to obtain a basic verification data set;

and obtaining the verification set according to the basic verification data set.

4. The method of claim 3, wherein said deriving the validation set from the base validation data set comprises:

if the total data volume in the basic verification data set is confirmed to be smaller than the target total data volume, taking partial data in a first data set as data of the basic verification data set to obtain a first verification data set, wherein the first data set is data which are included in the data set and are left except the basic verification data set;

and obtaining the verification set according to the first verification data set.

5. The method of claim 3, wherein said deriving the validation set from the base validation data set comprises:

screening the data in the basic verification data set if the total data amount in the basic verification data set is determined to be larger than the target total data amount, so as to obtain a first verification data set;

6. The method of claim 5,

the screening of the data in the basic verification data set comprises:

selecting at least m target data from the basic verification data set as a second verification data set, wherein the total number of tag categories corresponding to the m target data is greater than or equal to the total number of tag categories corresponding to any m data combinations remaining in the basic verification data set;

and repeating the above processes until the total number of the data of the second verification data set is equal to the target total data amount, so as to obtain the first verification data set.

7. The method of any one of claims 4 to 5,

the obtaining the verification set according to the first verification data set includes:

if the proportion of the confirmed labels is larger than a first set threshold value or the proportion of the confirmed labels is smaller than a second set threshold value, correcting part of data in the first verification data set to obtain the verification set;

wherein the label ratio is a ratio between a third ratio and a fourth ratio, the third ratio is a ratio between a total number of labels included in the first verification set and a total number of labels included in the data set, and the fourth ratio is a ratio between a total number of data included in the first verification set and a total number of data included in the data set.

8. The method of claim 7, wherein modifying the portion of the first verification data set if the validation tag ratio is less than a second set threshold comprises:

replacing first data in the first verification data set with second data, wherein the second data belongs to data included in the data set other than data in the first verification data set, a tag category of the second data includes a tag category of the first data, and a total number of tags corresponding to the first data is less than a total number of tags corresponding to the second data;

and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is greater than or equal to the second set threshold value.

9. The method of claim 8, wherein the total number of tags corresponding to the second data is the most of the remaining data included in the data set other than the data in the first set of validation data.

10. The method of claim 7,

if the ratio of the confirmation tags is larger than a first set threshold, correcting partial data in the first verification data set, including:

replacing third data in the first verification data set with fourth data, wherein the fourth data belongs to data included in the data set except for data in the first verification data set, the label category of the third data includes the label category of the fourth data, and the number of labels corresponding to the third data is greater than the number of labels corresponding to the fourth data;

and obtaining the verification set by executing one or more times of the replacement process until the label proportion of all data included in the first verification data set after all times of replacement is less than or equal to the first set threshold.

11. The method of claim 10, wherein the number of tags corresponding to the fourth data is the least of the remaining data included in the data set other than the data in the first set of validation data.

12. The method of claim 1, wherein prior to said screening out a validation set from a dataset, the method further comprises:

obtaining tags with the occurrence frequency smaller than a set threshold value, which are included in an original data set, to obtain a rare tag set;

resampling data corresponding to the rare label set to obtain a sampling result;

and constructing the data set according to the sampling result.

13. A method of training a natural language processing model, characterized in that the natural language processing model is trained and validated according to a validation set and a training set obtained according to any one of claims 1-12, resulting in a target natural language processing model.

14. A method for processing text, characterized in that the classification result is obtained by classifying the files to be processed according to the target natural language model obtained by the method of claim 13.

15. The method of claim 14, wherein after said obtaining the classification result, the method further comprises: and providing a retrieval result to the user or recommending a target file to the user according to the classification result.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 15.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program is operable to implement the method of any one of claims 1 to 15.