CN112434471A

CN112434471A - Method, system, electronic device and storage medium for improving model generalization capability

Info

Publication number: CN112434471A
Application number: CN202011437875.XA
Authority: CN
Inventors: 王璋琪; 段少毅
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-02

Abstract

The application discloses a method, a system, an electronic device and a storage medium for improving model generalization capability. The method for improving the generalization capability of the model comprises the following steps: obtaining a confrontation verification set: respectively labeling an original training set and an original test set, combining and establishing a new training set, training a model according to the new training set, predicting and then obtaining a confrontation verification set from a prediction result; a model obtaining step: and training a model by utilizing an actual training set and the confrontation verification set, and acquiring a final model according to the result of the model on the confrontation verification set. The invention provides a PHM data modeling-based one-stop machine learning platform, which is based on actual business, provides a vertical field scene solution for users and improves the development efficiency of the users.

Description

Method, system, electronic device and storage medium for improving model generalization capability

Technical Field

The present application relates to the field of data modeling, and in particular, to a method, a system, an electronic device, and a storage medium for improving model generalization capability.

Background

With the rapid development of internet technology, more and more companies can collect a large amount of data and use the data to model by means of machine learning or deep learning methods to guide business. The key to fully exploiting this data is how to model so that the model can predict well what has not been seen. Generally, a company-related technician will collect a portion of the labeled exemplars and then model the portion of labeled exemplars to label those unlabeled exemplars. In the case of a gender prediction model, a sample of known gender labels is collected, and then the model is trained using the training set. Samples of unknown gender (referred to as test sets) can be labeled using this model for prediction. However, poor generalization capability of the model often occurs when prediction is performed, because the reality is that feature distributions of the training set and the test set obtained by us are often inconsistent, and overfitting is easy to occur during model training, thereby causing poor generalization capability of the model. In practice, a method of randomly extracting a part of data in a training set as a verification set is generally adopted, and a model with better performance in the verification set is adopted to avoid model overfitting. Although this approach can avoid overfitting to some extent, it is difficult to solve the problem of poor generalization capability due to inconsistent distribution of features in the training set test set.

Therefore, in view of the above situation, a method, a system, an electronic device, and a storage medium for improving the model generalization capability have been made. The generalization capability of the model is effectively improved by introducing a countermeasure verification method, a 5-fold cross binary classification model is established by marking different labels on a training set and a test set respectively, the training set is predicted by using the binary classification model, samples predicted as labels of the test set in samples of the training set are set as a countermeasure verification set, then the model is trained by using the samples of the residual training set, and the model with the best performance on the countermeasure verification set is obtained. According to the invention, by constructing the confrontation verification set, the feature distribution of the verification set and the feature distribution of the test set can be kept consistent as much as possible, so that the model with the best performance on the verification set can be best performed on the test set, and overfitting is effectively avoided and the generalization capability of the model is effectively improved.

Disclosure of Invention

The embodiment of the application provides a method, a system, electronic equipment and a storage medium for improving the generalization capability of a model, so as to at least solve the problem of subjective factor influence in the related technology.

The invention provides a method for improving the generalization capability of a model, which is based on a screening and verifying set and comprises the following steps:

obtaining a confrontation verification set: respectively labeling an original training set and an original test set, combining and establishing a new training set, training a model according to the new training set, predicting and then obtaining a confrontation verification set from a prediction result;

a model obtaining step: and training a model by utilizing an actual training set and the confrontation verification set, and acquiring a final model according to the result of the model on the confrontation verification set.

In the method for improving the generalization ability of the model, the obtaining step of the confrontation verification set includes:

a new training set construction step: respectively labeling the original training set and the original test set with labels 0 and 1, and combining to construct a new training set;

selecting a confrontation verification set: and training the model according to the new training set and predicting, marking labels on the new training set, and selecting the new training set predicted as a test set as a confrontation verification set.

In the method for improving the generalization ability of the model, the model obtaining step includes training the model by using the actual training set and the confrontation verification set, and obtaining the final model according to the result of the model on the confrontation verification set.

The invention also provides a system for improving the generalization ability of the model, which is characterized in that the method for improving the generalization ability of the model is applicable to the method based on the screening verification set and comprises the following steps:

a countermeasure verification set acquisition unit: respectively labeling an original training set and an original test set, merging and establishing a new training set, training a model according to the new training set, predicting, and acquiring a confrontation verification set from a prediction result;

a model acquisition unit: and training a model by utilizing an actual training set and the confrontation verification set, and acquiring a final model according to the result of the model on the confrontation verification set.

In the system for improving the generalization ability of the model, the countermeasure verification set obtaining unit includes:

a new training set building module: respectively labeling the original training set and the original test set with labels 0 and 1, and combining to construct a training set;

a confrontation verification set selection module: and training the model according to the new training set and predicting, marking labels on the new training set, and selecting the new training set predicted as a test set as a confrontation verification set.

In the system for improving the generalization ability of the model, the model obtaining unit is configured to train the model by using the actual training set and the confrontation verification set, and obtain the final model according to the result of the model on the confrontation verification set.

The present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the above methods for improving the model generalization capability when executing the computer program.

The invention further provides an electronic device readable storage medium, on which computer program instructions are stored, and the computer program instructions, when executed by the processor, implement the method for improving the model generalization ability of any one of the above.

Compared with the related art, the method, the system, the electronic device and the storage medium for improving the generalization capability of the model provided by the embodiment of the application effectively improve the generalization capability of the model by introducing a countermeasure verification method, establish a 5-fold cross binary classification model by marking different labels on a training set and a test set respectively, predict the training set by using the binary classification model, set a sample predicted as a label of the test set in a sample of the training set as a countermeasure verification set, then train the model by using samples of the remaining training set, and obtain the model which is best represented on the countermeasure verification set. According to the invention, by constructing the confrontation verification set, the feature distribution of the verification set and the feature distribution of the test set can be kept consistent as much as possible, so that the model with the best performance on the verification set can be best performed on the test set, and overfitting is effectively avoided and the generalization capability of the model is effectively improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a block diagram of a method for improving model generalization capability according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for improving model generalization capability according to an embodiment of the present application;

FIG. 3 is a flowchart of challenge verification set acquisition steps according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a system for improving the generalization ability of a model according to the present invention;

fig. 5 is a frame diagram of an electronic device according to an embodiment of the present application.

Wherein the reference numerals are:

a countermeasure verification set acquisition unit: 21;

a model acquisition unit: 22;

a new training set building module: 211;

a confrontation verification set selection module: 212;

81: a processor;

82: a memory;

83: a communication interface;

80: a bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The present invention is based on screening validation sets, briefly described below.

The data modeling refers to abstract organization of various real world data, and determines the range of the database to be administered, the organization form of the data and the like until the data are converted into a real database. After the conceptual model abstracted after system analysis is converted into a physical model, database entities and the process of the relationship among the entities (the entities are generally tables) are established in tools such as visio or erwin. In software engineering, data modeling is the process of building a data model of an information system using formal data modeling techniques. Data modeling is a process of an information system for defining and analyzing the requirements of data and the corresponding support it needs. Therefore, in the process of data modeling, the professional data modeling work involved is closely related to the benefits of enterprises and information systems of users. From the requirements to the actual database, there are three different types. The data model for information systems, as a conceptual data model, is essentially the first canonical technique for a set of recorded data requirements. The data is first used to discuss the initial requirements of the enterprise and then transformed into a logical data model that can be implemented in a conceptual model of the data structure in the database. The implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is to determine the logical data model into the physical data model to the specific requirements on data access performance and storage. Data modeling defines not only data elements, but also their structure and relationships between them. The main activities in the molding process include: determining data and its associated processes (faithfully sales personnel need to look at online product catalogs and submit new customer orders); defining data (e.g., data type, size, and default values); ensuring the integrity of the data (using business rules and validation checks); defining operation processes (such as security check and backup); a data storage technique (e.g., relational, hierarchical, or index storage technique) is selected. It must be understood that modeling will typically involve the management of a company in an unexpected manner. For example, data ownership (and the implicit responsibility for data maintenance, accuracy, and timeliness) is often questioned when there is new insight into which data elements should be maintained by which organizations. Data design often forces companies to recognize how enterprise data systems are interdependent and encourages companies to catch the efficiency improvements, cost savings, and strategic opportunities that result from coordinated data planning. At the end of modeling, the requirements of the application program are completely defined, data and services which can be reused by other enterprise-level application programs are determined, and a strong foundation is laid for future expansion.

In machine learning, samples are typically divided into independent three-part training set (train set), validation set (validation set), and test set (test set). Where the test set is used to verify how well the model that is ultimately selected to be optimal performs. The training set is used to estimate the model, the validation set is used to determine the network structure or parameters that control the complexity of the model, and the test set examines how well the model is performing to the final selection of the optimal model. One typical division is that the training set is 50% of the total samples, while the others are 25%, all three being randomly drawn from the samples. However, when the total number of samples is small, the above division is not appropriate. It is common to leave a small portion as a test set. And then adopting a K-fold cross-validation method for the rest N samples. The method comprises the steps of disordering a sample, uniformly dividing the sample into K parts, selecting K-1 parts of training in turn, verifying the rest parts, calculating the sum of squares of prediction errors, and averaging the sum of squares of prediction errors of K times to be used as a basis for selecting an optimal model structure. The special method is that K takes N. Generalization ability (generalization ability) refers to the ability of a machine learning algorithm to adapt to a fresh sample. The purpose of learning is to learn the rules hidden behind the data, and for data beyond a learning set with the same rules, a trained network can also give appropriate output, and the capability is called generalization capability. It is generally desirable that a network trained by training samples have a strong generalization capability, i.e., the ability to give reasonable responses to new inputs. It should be noted that the more times that the training is not performed, the more the correct input-output mapping relationship is obtained. The performance of a network is measured primarily by its generalization capability. The selected sample data is insufficient to represent the preset classification rule due to the fact that the selected sample data is wrong, such as the number of samples is too small, the sampling method is wrong, the label of the sample is wrong, and the like; the noise interference of the sample is too large, so that the machine considers part of the noise as a characteristic so as to disturb the preset classification rule; the assumed model cannot reasonably exist, or the assumed condition is not true actually; too many parameters and too high model complexity; for the decision tree model, if we have no reasonable limit on its growth, its free growth may make the node only contain simple event data (event) or non-event data (no event), so that it can perfectly match (fit) the training data but cannot adapt to other data sets; for neural network models: a) the sample data may have a classification decision surface which is not unique, and with the learning, the BP algorithm makes the weight value possibly converge on a too complex decision surface; b) the number of iterations of weight learning is sufficient (overriding), fitting noise in the training data and non-representative features in the training examples. Overfitting refers to making the assumptions overly rigorous in order to obtain consistent assumptions. Avoiding overfitting is a central task in classifier design. Classifier performance is typically evaluated using methods that increase the amount of data and test sample sets. Given a hypothesis space H, one hypothesis H belongs to H, and if there are other hypotheses H ' belonging to H, such that the error rate of H is less than H ' over the training examples, but H ' is less than H over the entire example distribution, then the hypothesis H is said to overfit the training data. One hypothesis is considered to be overfitting when a better fit is obtained on the training data than on the other hypothesis, but the data is not well fitted on the data set outside the training data. The main reason for this is that there is noise in the training data or that there is too little training data.

The existing method for solving model overfitting and improving the generalization capability of the model generally divides an original training set into a verification set and a training set at random, and judges whether the model overfitts by using the verification set so as to improve the effect of the model. Although the method of randomly dividing the training set can solve the overfitting problem to a certain extent, the generalization capability of the model is difficult to be improved, and the model trained according to the features of the training set cannot be ensured to be suitable for the features of the test set because the verification set used for verifying whether the model overfitts comes from the randomly extracted training set. Therefore, the feature distribution of the verification set is consistent with that of the training set but cannot be guaranteed to be consistent with that of the test set, and therefore, the best model obtained by the verification set cannot be guaranteed to perform best on the test set.

The method, the system, the electronic device and the storage medium for improving the generalization capability of the model effectively improve the generalization capability of the model by introducing a countermeasure verification method, establish a 5-fold cross binary classification model by labeling a training set and a test set with different labels respectively, predict the training set by using the binary classification model, set samples predicted as the labels of the test set in samples of the training set as a countermeasure verification set, then train the model by using samples of the remaining training set, and obtain the model which is best represented on the countermeasure verification set. According to the invention, by constructing the confrontation verification set, the feature distribution of the verification set and the feature distribution of the test set can be kept consistent as much as possible, so that the model with the best performance on the verification set can be best performed on the test set, and overfitting is effectively avoided and the generalization capability of the model is effectively improved.

The following will describe the embodiments of the present application by taking a screening verification set as an example.

Example one

The embodiment provides a method for improving the generalization capability of a model. Referring to fig. 1-3, fig. 1 is a frame diagram of a method for improving model generalization capability according to an embodiment of the present application; FIG. 2 is a flow chart of a method for improving model generalization capability according to an embodiment of the present application; FIG. 3 is a flowchart of challenge verification set acquisition steps according to an embodiment of the present application; as shown in fig. 1 to 3, the method for improving the generalization capability of the model includes the following steps:

countermeasure verification set acquisition step S1: respectively labeling an original training set and an original test set, combining and establishing a new training set, training a model according to the new training set, predicting and then obtaining a confrontation verification set from a prediction result;

model acquisition step S2: and training a model by utilizing an actual training set and the confrontation verification set, and acquiring a final model according to the result of the model on the confrontation verification set.

In an embodiment, the challenge verification set obtaining step S1 includes:

a new training set construction step S11: respectively labeling the original training set and the original test set with labels 0 and 1, and combining to construct a new training set;

countermeasure verification set selection step S12: and training the model according to the new training set and predicting, marking labels on the new training set, and selecting the new training set predicted as a test set as a confrontation verification set.

Specifically, the challenge proof set obtaining step S1 includes labeling the original training set and the original test set with 0 and 1, respectively, and then combining the labeling to construct a training set for selecting a challenge proof set sample. Then, a 5-fold cross validation training model is utilized, all training set samples are predicted to be labeled, and a sample with the predicted label of 1 (namely, a test set) and the highest probability of 10% is selected as a validation set, namely, the confrontation validation set; the remaining training set samples are used as the actual training set.

In an embodiment, the model obtaining step S2 includes training a model using the actual training set and the challenge validation set, and obtaining a final model according to a result of the model on the challenge validation set.

Therefore, the method, the system, the electronic device and the storage medium for improving the generalization capability of the model provided by the embodiment of the invention effectively improve the generalization capability of the model by introducing a countermeasure verification method, establish a 5-fold cross binary classification model by labeling different labels on a training set and a test set respectively, predict the training set by using the binary classification model, set a sample predicted as a label of the test set in a sample of the training set as a countermeasure verification set, then train the model by using samples of the remaining training set, and obtain the model which has the best performance on the countermeasure verification set. According to the invention, by constructing the confrontation verification set, the feature distribution of the verification set and the feature distribution of the test set can be kept consistent as much as possible, so that the model with the best performance on the verification set can be best performed on the test set, and overfitting is effectively avoided and the generalization capability of the model is effectively improved.

Example two

Referring to fig. 4, fig. 4 is a schematic diagram of a system structure for improving model generalization capability according to the present invention. As shown in fig. 4, the system for improving model generalization capability of the present invention is applicable to the method for improving model generalization capability, and the system for improving model generalization capability includes:

the countermeasure verification set acquisition unit 21: respectively labeling an original training set and an original test set, merging and establishing a new training set, training a model according to the new training set, predicting, and acquiring a confrontation verification set from a prediction result;

the model acquisition unit 22: and training a model by utilizing an actual training set and the confrontation verification set, and acquiring a final model according to the result of the model on the confrontation verification set.

In the present embodiment, the countermeasure verification set acquisition unit 21 includes:

the new training set construction module 211: respectively labeling the original training set and the original test set with labels 0 and 1, and combining to construct a training set;

the confrontation verification set selection module 212: and training the model according to the new training set and predicting, marking labels on the new training set, and selecting the new training set predicted as a test set as a confrontation verification set.

EXAMPLE III

Referring to fig. 5, this embodiment discloses a specific implementation of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 implements any of the file system capacity management optimization methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 5, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device may be connected to a system that enhances the generalization capability of the model, thereby implementing the methods described in connection with fig. 1-3.

In addition, in combination with the method for improving the model generalization capability in the foregoing embodiments, the embodiments of the present application may provide a readable storage medium for an electronic device to implement. The electronic device readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above-described embodiments of a method for enhancing model generalization capability.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The method for improving the model generalization ability is characterized in that based on a screening verification set, the method for improving the model generalization ability comprises the following steps:

2. The method for improving model generalization capability of claim 1, wherein said confrontation verification set obtaining step comprises:

3. The method of improving model generalization capability of claim 1, wherein the model obtaining step comprises training a model using the actual training set and the challenge-validation set, and obtaining a final model according to the result of the model on the challenge-validation set.

4. A system for improving the generalization ability of a model, which is adapted to the method for improving the generalization ability of a model according to any one of claims 1 to 3 based on a screening validation set, comprising:

5. The system for improving model generalization capability of claim 4, wherein the challenge verification set obtaining unit comprises:

6. The system for improving the generalization ability of the model according to claim 5, wherein the model obtaining unit is configured to train the model using the actual training set and the confrontation verification set, and obtain the final model according to the result of the model on the confrontation verification set.

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for improving model generalization capability of any one of claims 1 to 3 when executing the computer program.

8. An electronic device readable storage medium having stored thereon computer program instructions which, when executed by the processor, implement the method of promoting model generalization capability according to any one of claims 1 to 3.