CN112052915B

CN112052915B - Data training method, device, equipment and storage medium

Info

Publication number: CN112052915B
Application number: CN202011055438.1A
Authority: CN
Inventors: 万明霞
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2024-02-13
Anticipated expiration: 2040-09-29
Also published as: CN112052915A

Abstract

The application provides a data training method, a device, equipment and a storage medium, which acquire sample data in an original training data set; preprocessing to obtain positive and negative sample data; traversing all column features contained in positive and negative sample data respectively; each column of characteristics is randomly disturbed and recombined according to all columns of characteristics contained in the positive and negative sample data, so that new positive and negative sample data are obtained; adding the new training data set into the original training data set to obtain a new training data set; and used for model training. In the method, the N features are mutually independent and are subjected to normal distribution through random scrambling and recombination of the features in each sample data, and based on the processing, the non-image and non-voice data can be subjected to data enhancement, so that the data set of the data can be effectively expanded, the phenomenon of model overfitting can be effectively improved when the data is used for data training, and the accuracy of model prediction is improved.

Description

Data training method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data training method, apparatus, device, and storage medium.

Background

At present, in the process of training a model by using training sample data, if the size of the adopted training sample data is smaller, the fitting phenomenon is easy to occur, namely the training sample data is excessively depended on in the process of training the model, so that the accuracy of a model prediction result is adversely affected.

For image data and voice data, data enhancement means such as overturn, rotation and Gaussian noise are generally adopted to amplify the data scale of the training sample, so as to improve the overfitting phenomenon in the model training process and improve the accuracy of the model prediction result; however, for non-image data and non-speech data, the sample data size cannot be amplified by the data enhancement means, which further leads to problems of easy overfitting phenomenon and inaccurate prediction result in the process of model training by using the non-image data and the non-speech data.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a data training method, apparatus, device, and storage medium, so as to achieve the purpose of amplifying a sample data size by a data enhancement means in a model training process using non-image data and non-speech data, thereby improving a phenomenon of model overfitting, and improving accuracy of model prediction.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

in one aspect, an embodiment of the present invention provides a data training method, where the method includes:

acquiring sample data in an original training data set;

preprocessing the sample data to obtain positive sample data and negative sample data;

traversing all column features contained in the positive sample data and the negative sample data for the positive sample data and the negative sample data respectively;

each column of characteristics in all columns of characteristics is randomly disordered for all columns of characteristics contained in the positive sample data and the negative sample data respectively, and the characteristics are recombined to obtain new positive sample data and new negative sample data;

adding the new positive sample data and the new negative sample data to the original training data set to obtain a new training data set;

and performing model training by using the new training data set.

Optionally, traversing all column features contained in the positive sample data and the negative sample data for the positive sample data and the negative sample data respectively includes:

traversing all column features contained in the positive sample data of a first preset proportion and the negative sample data of a second preset proportion respectively for the positive sample data of the first preset proportion and the negative sample data of the second preset proportion;

the first preset proportion indicates the proportion of the number of positive sample data used for traversing to the number of all positive sample data, and the second preset proportion indicates the proportion of the number of negative sample data used for traversing to the number of all negative sample data.

traversing all column features contained in the positive sample data and the negative sample data meeting a third preset proportional relation according to the positive sample data and the negative sample data meeting the third preset proportional relation respectively;

wherein the third preset proportion indicates a proportion between the number of positive sample data for traversal and the number of negative sample data for traversal.

The traversing all column features contained in the positive sample data and the negative sample data for the positive sample data and the negative sample data, respectively, includes:

traversing all column features contained in all positive sample data and all negative sample data for all positive sample data and all negative sample data, respectively.

In another aspect, an embodiment of the present invention provides a data training apparatus, including:

the acquisition module is used for acquiring sample data in the original training data set;

the preprocessing module is used for preprocessing the sample data to obtain positive sample data and negative sample data;

a traversing feature module, configured to traverse all column features contained in the positive sample data and the negative sample data with respect to the positive sample data and the negative sample data, respectively;

the processing module is used for randomly scrambling each column of characteristics of all columns of characteristics aiming at all columns of characteristics contained in the positive sample data and the negative sample data respectively and recombining the column of characteristics to obtain new positive sample data and new negative sample data;

an adding module, configured to add the new positive sample data and the new negative sample data to the original training data set, to obtain a new training data set;

and the training module is used for carrying out model training by utilizing the new training data set.

Optionally, the traversal feature module is specifically configured to traverse all column features contained in the positive sample data of the first preset proportion and the negative sample data of the second preset proportion, respectively, for the positive sample data of the first preset proportion and the negative sample data of the second preset proportion;

Optionally, the traversal feature module is specifically configured to traverse all column features included in the positive sample data and the negative sample data that satisfy the third preset proportional relationship, with respect to the positive sample data and the negative sample data that satisfy the third preset proportional relationship, respectively;

Optionally, the traversing feature module is specifically configured to traverse all column features contained in all positive sample data and all negative sample data with respect to all positive sample data and all negative sample data, respectively.

In another aspect, an embodiment of the present invention provides a data training apparatus, including a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to implement the method when invoking and executing the computer program stored in the memory.

In another aspect, embodiments of the present invention provide a storage medium having stored therein computer-executable instructions that, when loaded and executed by a processor, implement the method.

Based on the data training method, device, equipment and storage medium provided by the embodiment of the invention, sample data in an original training data set are obtained; preprocessing sample data to obtain positive sample data and negative sample data; traversing all column features contained in the positive sample data and the negative sample data respectively for the positive sample data and the negative sample data; each column of characteristics in all columns of characteristics is randomly disturbed and recombined according to all columns of characteristics contained in the positive sample data and the negative sample data respectively to obtain new positive sample data and new negative sample data; adding the new positive sample data and the new negative sample data into the original training data set to obtain a new training data set; model training is performed using the new training dataset. In the scheme provided by the embodiment of the invention, the N features are mutually independent and are subjected to normal distribution by randomly scrambling and recombining the features in each sample data, and the non-image and non-voice data can be subjected to data enhancement based on the processing, so that the data set of the data can be effectively expanded, the phenomenon of model overfitting can be effectively improved when the data is used for data training, and the accuracy of model prediction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a data training method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data training device according to an embodiment of the present invention;

fig. 3 is a block diagram of a data training device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

According to the background technology, in the process of model training of non-image data and non-voice data, the sample data size cannot be amplified by a data enhancement means, so that the problems of over-fitting phenomenon and inaccurate prediction result, which are easy to occur in the process of model training by using the non-image data and the non-voice data, are caused.

Therefore, the embodiment of the invention provides a data training method, a device, equipment and a storage medium, so that the purposes of amplifying the sample data scale by a data enhancement means, improving the phenomenon of model overfitting and improving the accuracy of model prediction in the process of model training by using non-image data and non-voice data are realized.

Referring to fig. 1, a flow chart of a data training method according to an embodiment of the present invention is shown. The method comprises the following steps:

s101: sample data in the original training data set is obtained.

In the process of specifically implementing S101, all sample data in the original training data set may be acquired, or a portion of sample data in the original training data set may be acquired.

S102: and preprocessing the sample data to obtain positive sample data and negative sample data.

In the process of implementing S102, the following preprocessing may be performed based on the sample data obtained by executing S101:

first, the sample data obtained in S101 is subjected to screening processing, and abnormal data in the sample data is removed.

Secondly, carrying out standardization processing on the sample data, scaling the attribute of the sample data to a certain appointed range, converting the sample data into data with zero mean and one variance, and enabling each feature in the sample data to be subjected to Gaussian normal distribution.

And finally, carrying out feature coding processing on the sample data, converting the numerical attribute in the sample data into the attribute of the Boolean value, and setting a threshold value as a separation point for dividing the attribute value into 0 and 1. Alternatively, in the implementation process, sample data with an attribute value of 1 may be referred to as positive sample data, and sample data with an attribute value of 0 may be referred to as negative sample data.

S103: all column features contained in the positive and negative sample data are traversed for the positive and negative sample data, respectively.

In the process of implementing S103 in particular, there are a variety of implementations.

Optionally, the first scheme is: traversing all column features contained in the positive sample data of the first preset proportion and the negative sample data of the second preset proportion respectively for the positive sample data of the first preset proportion and the negative sample data of the second preset proportion.

It should be noted that the first preset proportion and the second preset proportion may take the same value or may take different values. Of course, the first preset ratio may be a value greater than the second preset ratio, or may be a value less than the second preset ratio, which is not limited herein.

The second scheme is as follows: and traversing all column features contained in the positive sample data and the negative sample data which meet the third preset proportional relation according to the positive sample data and the negative sample data which meet the third preset proportional relation respectively.

The third preset proportion may be a proportion value obtained by dividing the number of positive sample data used for traversal by the number of negative sample data used for traversal, or may be a proportion value obtained by dividing the number of negative sample data used for traversal by the number of positive sample data used for traversal.

The third scheme is as follows: all positive and negative sample data are traversed for all column features contained therein, respectively.

It should be further noted that, in the above three schemes, a specific scheme may be selected according to the actual scene application requirement, and in the implementation, the preset proportion may also be set according to the actual scene application requirement, for example, when the total positive sample data is smaller than the total negative sample data, the third preset proportion obtained by dividing the number of positive sample data used for traversal by the number of negative sample data used for traversal may be set to a larger value, which is, of course, only introduced by way of example.

S104: and randomly scrambling each column of features in all columns of features for all columns of features contained in the positive sample data and the negative sample data respectively, and recombining to obtain new positive sample data and new negative sample data.

In the specific implementation S104, each column of features of all columns of features included in the positive sample data is randomly scrambled and recombined to obtain new positive sample data, and each column of features of all columns of features included in the negative sample data is randomly scrambled and recombined to obtain new negative sample data.

In a specific implementation, random scrambling may be performed using the shuffle function of python, although other approaches may be used.

It should be noted that, in the random scrambling process, the characteristics of the current column to be scrambled are randomly scrambled, and the characteristics after random scrambling are still in the current column.

To facilitate an understanding of the foregoing regarding the random scrambling feature, the following is illustrated, although it is intended to be illustrative only.

For example, the positive sample data includes 3 columns of features, the first column of features, the second column of features and the third column of features are randomly scrambled at the same time or at different times (for example, sequentially performed in the order of the first column, the second column and the third column), the features of the first column which are randomly scrambled remain in the first column, the features of the second column which are randomly scrambled remain in the second column, and the features of the third column which are randomly scrambled remain in the third column.

It should be noted that, each column of features of all columns of features contained in the sample data is randomly scrambled, so that features contained in the randomly scrambled sample data are mutually independent, and subsequent amplification of the sample data scale is conveniently realized by a data enhancement means.

S105: and adding the new positive sample data and the new negative sample data to the original training data set to obtain a new training data set.

In the process of implementing S102, new positive sample data and new negative sample data may be added to the original training data set separately, or may be added to the original training data set after being mixed.

S106: model training is performed using the new training dataset.

In the scheme provided by the embodiment of the invention, the N features are mutually independent and are subjected to normal distribution by randomly scrambling and recombining the features in each sample data, and the non-image and non-voice data can be subjected to data enhancement based on the processing, so that the data set of the data can be effectively expanded, the phenomenon of model overfitting can be effectively improved when the data is used for data training, and the accuracy of model prediction is improved.

Based on the data training method disclosed by the embodiment of the invention, correspondingly, the embodiment of the invention also discloses a data training device. Referring to fig. 2, a block diagram of a data training device according to an embodiment of the present invention is shown.

The data training device comprises: an acquisition module 201, a preprocessing module 202, a traversal feature module 203, a processing module 204, an addition module 205, and a training module 206.

The acquisition module 201 is configured to: sample data in the original training data set is obtained.

The preprocessing module 202 is configured to: and preprocessing the sample data to obtain positive sample data and negative sample data.

The traversal feature module 203 is configured to: all column features contained in the positive and negative sample data are traversed for the positive and negative sample data, respectively.

The processing module 204 is configured to: and randomly scrambling each column of features in all columns of features for all columns of features contained in the positive sample data and the negative sample data respectively, and recombining to obtain new positive sample data and new negative sample data.

The adding module 205 is configured to: and adding the new positive sample data and the new negative sample data to the original training data set to obtain a new training data set.

The training module 206 is configured to: model training is performed using the new training dataset.

Optionally, the traversal feature module 203 is specifically configured to: traversing all column features contained in the positive sample data of the first preset proportion and the negative sample data of the second preset proportion respectively for the positive sample data of the first preset proportion and the negative sample data of the second preset proportion.

Alternatively, the traversal feature module 203 is specifically configured to: and traversing all column features contained in the positive sample data and the negative sample data which meet the third preset proportional relation according to the positive sample data and the negative sample data which meet the third preset proportional relation respectively.

Alternatively, the traversal feature module 203 is specifically configured to: all positive and negative sample data are traversed for all column features contained therein, respectively.

The specific implementation principle of each module in the data training device disclosed in the above embodiment of the present invention may refer to corresponding content in the data training method disclosed in the above embodiment of the present invention, and will not be described herein again.

Based on the data training device provided by the embodiment of the invention, the acquisition module acquires sample data in an original training data set; the preprocessing module preprocesses the sample data to obtain positive sample data and negative sample data; the traversal feature module traverses all column features contained in the positive sample data and the negative sample data according to the positive sample data and the negative sample data respectively; the processing module randomly breaks up each column of characteristics of all columns of characteristics according to all columns of characteristics contained in the positive sample data and the negative sample data respectively, and recombines the positive sample data and the negative sample data to obtain new positive sample data and new negative sample data; the adding module adds the new positive sample data and the new negative sample data into the original training data set to obtain a new training data set; the training module performs model training using the new training data set. In the scheme provided by the embodiment of the invention, the N features are mutually independent and are subjected to normal distribution by randomly scrambling and recombining the features in each sample data, and the non-image and non-voice data can be subjected to data enhancement based on the processing, so that the data set of the data can be effectively expanded, the phenomenon of model overfitting can be effectively improved when the data is used for data training, and the accuracy of model prediction is improved.

Based on the data training method and the data training device disclosed by the embodiment of the invention, the embodiment of the invention also discloses data training equipment. Referring to fig. 3, a block diagram of a data training device according to an embodiment of the present invention is shown.

The data training apparatus includes: a processor 301 and a memory 302.

A memory 302 for storing a computer program.

The processor 301 is configured to implement any of the data training methods disclosed above according to the embodiments of the present invention when invoking and executing the computer program stored in the memory 302.

Based on the data training method, the data training device and the data training equipment disclosed by the embodiment of the invention, the embodiment of the invention also discloses a storage medium.

The storage medium has stored therein computer executable instructions. When loaded and executed by a processor, the computer-executable instructions implement any of the data training methods disclosed above in accordance with embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data training, the method comprising:

acquiring sample data in an original training data set;

each column of features in all columns of features are randomly disturbed according to all columns of features contained in the positive sample data and the negative sample data respectively, so that the features contained in the randomly disturbed sample data are mutually independent and recombined to obtain new positive sample data and new negative sample data; during the process of randomly scrambling each column of features in all columns of features, randomly scrambling the current column of features to be scrambled, wherein the randomly scrambled features are still in the current column;

and performing model training by using the new training data set.

2. The method of claim 1, wherein traversing all column features contained by the positive sample data and the negative sample data for the positive sample data and the negative sample data, respectively, comprises:

3. The method of claim 1, wherein traversing all column features contained by the positive sample data and the negative sample data for the positive sample data and the negative sample data, respectively, comprises:

4. The method of claim 1, wherein traversing all column features contained by the positive sample data and the negative sample data for the positive sample data and the negative sample data, respectively, comprises:

5. A data training apparatus, the apparatus comprising:

the processing module is used for randomly scrambling each column of characteristics of all columns of characteristics aiming at all columns of characteristics contained in the positive sample data and the negative sample data respectively so as to enable the characteristics contained in the randomly scrambled sample data to be mutually independent and recombined to obtain new positive sample data and new negative sample data; during the process of randomly scrambling each column of features in all columns of features, randomly scrambling the current column of features to be scrambled, wherein the randomly scrambled features are still in the current column;

6. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the traversal feature module is specifically configured to traverse all column features contained in the positive sample data of a first preset proportion and the negative sample data of a second preset proportion for the positive sample data of the first preset proportion and the negative sample data of the second preset proportion respectively;

7. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the traversal feature module is specifically configured to traverse all column features included in the positive sample data and the negative sample data that satisfy a third preset proportional relationship, with respect to the positive sample data and the negative sample data that satisfy the third preset proportional relationship, respectively;

8. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the traversing feature module is specifically configured to traverse all column features contained in all positive sample data and all negative sample data with respect to all positive sample data and all negative sample data, respectively.

9. A data training device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor being adapted to implement the method of any of claims 1 to 4 when invoking and executing a computer program stored in the memory.

10. A storage medium having stored therein computer executable instructions which, when loaded and executed by a processor, implement the method of any one of claims 1 to 4.