CN114021667A - Method and device for determining training data and electronic equipment - Google Patents

Method and device for determining training data and electronic equipment Download PDF

Info

Publication number
CN114021667A
CN114021667A CN202111422756.1A CN202111422756A CN114021667A CN 114021667 A CN114021667 A CN 114021667A CN 202111422756 A CN202111422756 A CN 202111422756A CN 114021667 A CN114021667 A CN 114021667A
Authority
CN
China
Prior art keywords
training data
abnormal
value
data
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111422756.1A
Other languages
Chinese (zh)
Inventor
申亚坤
赵辉
陶威
周慧婷
刘烨敏
谭莹坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202111422756.1A priority Critical patent/CN114021667A/en
Publication of CN114021667A publication Critical patent/CN114021667A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for determining training data and electronic equipment, which can be applied to the field of artificial intelligence or the field of finance. And replacing the features of the first training data by using the replacement features to obtain second training data, performing data adding operation on the second training data to obtain third training data with the same number as that of the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.

Description

Method and device for determining training data and electronic equipment
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for determining training data and electronic equipment.
Background
With the continuous development of the technology, the usage rate of the neural network model is continuously increased. Before the neural network model is used, the neural network model is firstly trained by using training data. At present, in a stage of training data preparation, some users forge training data by adopting a data cheating mode, such as a mode of modifying certain characteristics of the training data, and the accuracy of a neural network model obtained by training based on the training data is lower.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for determining training data, and an electronic device, so as to solve the problem that a neural network model trained based on forged training data has low accuracy.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method of determining training data, comprising:
acquiring training data to be analyzed and a model prediction result of the training data to be analyzed;
carrying out anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and using the training data as abnormal training data;
deleting the abnormal training data from the training data to be analyzed to obtain first training data;
determining a first incidence relation between the characteristic value of the training data to be analyzed and the corresponding label value, and determining a second incidence relation between the characteristic value of the first training data and the corresponding label value;
determining a replacement feature based on the first incidence relation and the second incidence relation;
replacing the feature of the first training data by using the replacement feature to obtain second training data, and determining a first residual value corresponding to the first training data and the second training data;
performing data adding operation on the first training data to obtain third training data with the same quantity as the training data to be analyzed;
and determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value.
Further, performing anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and using the training data as the abnormal training data, including:
classifying the model prediction result according to the label value to obtain a classification result;
taking training data to be analyzed corresponding to the model prediction result of which the numerical value meets the range of a preset data interval in the classification result as first abnormal training data;
acquiring user behavior data corresponding to a model prediction result, and performing clustering analysis on the user behavior data to obtain a clustering result;
determining abnormal users based on the clustering result, and taking training data to be analyzed corresponding to the abnormal users as second abnormal training data;
and combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
Further, based on the clustering result, determining abnormal users, including:
screening out users of which the user behavior data meet a preset behavior rule based on the clustering result, and taking the users as initial abnormal users;
taking the training data to be analyzed of the initial abnormal user as fourth training data, and taking the training data to be analyzed of the non-initial abnormal user as fifth training data;
calculating a third residual value of the feature value of the fifth training data;
performing feature replacement on the fourth training data to obtain sixth training data, and calculating a fourth residual value of a feature value of the sixth training data;
and under the condition that the third residual value and the fourth residual value meet a preset residual value rule, taking the initial abnormal user as an abnormal user.
Further, determining an alternative feature based on the first and second associations includes:
taking the characteristic value with the maximum change of the characteristic sequence number value in the first association relation and the second association relation as a first replacement characteristic;
randomly screening out a preset number of features from the feature values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation, and taking the features as second replacement features;
and taking the first replacement feature and the second replacement feature as replacement features.
Further, still include:
and returning to execute the step of performing abnormal analysis on the model prediction result under the condition that the difference value between the first residual value and the second residual value is not less than a preset threshold value so as to determine training data to be analyzed corresponding to the abnormal model prediction result, using the training data as abnormal training data, and sequentially executing the steps until the first training data is used as training data used in model training under the condition that the difference value between the first residual value and the second residual value is less than the preset threshold value.
An apparatus for determining training data, comprising:
the data acquisition module is used for acquiring training data to be analyzed and a model prediction result of the training data to be analyzed;
the abnormal analysis module is used for performing abnormal analysis on the model prediction result so as to determine training data to be analyzed corresponding to the abnormal model prediction result and using the training data as abnormal training data;
the data deleting module is used for deleting the abnormal training data from the training data to be analyzed to obtain first training data;
the relation determining module is used for determining a first incidence relation between the characteristic value of the training data to be analyzed and the corresponding label value and determining a second incidence relation between the characteristic value of the first training data and the corresponding label value;
a feature determination module for determining a replacement feature based on the first incidence relation and the second incidence relation;
the replacing module is used for replacing the characteristics of the first training data by using the replacing characteristics to obtain second training data and determining a first residual value corresponding to the first training data and the second training data;
the data adding module is used for performing data adding operation on the first training data to obtain third training data with the same quantity as the training data to be analyzed;
and the data determining module is used for determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value.
Further, the anomaly analysis module includes:
the classification submodule is used for classifying the model prediction result according to the label value to obtain a classification result;
the first data determination submodule is used for taking training data to be analyzed corresponding to the model prediction result of which the numerical value meets the range of a preset data interval in the classification result as first abnormal training data;
the clustering submodule is used for acquiring user behavior data corresponding to a model prediction result and carrying out clustering analysis on the user behavior data to obtain a clustering result;
the second data determining submodule is used for determining an abnormal user based on the clustering result and taking training data to be analyzed corresponding to the abnormal user as second abnormal training data;
and the third data determining submodule is used for combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
Further, the second data determination submodule includes:
the user screening unit is used for screening out users of which the user behavior data meet a preset behavior rule based on the clustering result and taking the users as initial abnormal users;
the data determining unit is used for taking the training data to be analyzed of the initial abnormal user as fourth training data and taking the training data to be analyzed of the non-initial abnormal user as fifth training data;
a first residual calculation unit configured to calculate a third residual value of the feature value of the fifth training data;
a second residual calculation unit, configured to perform feature replacement on the fourth training data to obtain sixth training data, and calculate a fourth residual of a feature value of the sixth training data;
and the user determining unit is used for taking the initial abnormal user as an abnormal user under the condition that the third residual value and the fourth residual value meet a preset residual value rule.
Further, the feature determination module includes:
a first feature determining unit, configured to use, as a first replacement feature, a feature value of the first association relationship and the second association relationship, where the feature sequence number value changes the most;
the characteristic screening unit is used for randomly screening out a preset number of characteristics from the characteristic values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation and taking the characteristics as second replacement characteristics;
a second feature determination unit configured to use the first replacement feature and the second replacement feature as replacement features.
An electronic device, comprising: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls a program and is used to perform one of the above-described methods of training data determination.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method, a device and electronic equipment for determining training data, which are used for acquiring training data to be analyzed and a model prediction result of the training data to be analyzed, performing anomaly analysis on the model prediction result to determine the training data to be analyzed corresponding to the abnormal model prediction result, taking the training data as abnormal training data, deleting the abnormal training data from the training data to be analyzed to obtain first training data, determining a first residual value corresponding to the first training data, determining a first incidence relation between a characteristic value of the training data to be analyzed and a corresponding label value, determining a second incidence relation between the characteristic value of the first training data and the corresponding label value, determining a replacement characteristic based on the first incidence relation and the second incidence relation, and replacing the characteristic of the first training data by using the replacement characteristic, obtaining second training data, performing data adding operation on the second training data to obtain third training data with the same quantity as the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for determining training data according to an embodiment of the present invention;
fig. 2 is a flowchart of another method for determining training data according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for determining training data according to another embodiment of the present invention;
fig. 4 is a flowchart of a method for determining training data according to another embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for determining training data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the continuous development of the technology, the usage rate of the neural network model is continuously increased. Before the neural network model is used, the neural network model is firstly trained by using training data. At present, some users modify training data in a data cheating mode in a training data preparation stage, for example, internal users of model training, such as self-owned employees, technical instructors, external aids and the like, may intervene in a model training process (namely white boxes) in a data cheating mode for modifying characteristics of the training data, and external users of model training, such as loan users, credit risk assessment users and the like, perform repeated iterative input and output on the model in a black box mode, so that the model is intervened in a data cheating mode for resisting case generation and the like, and the two modes both enable the accuracy of the trained neural network model to be low, seriously affect the normal operation of a machine learning model and cause abnormal business handling.
In order to solve the technical problem, the inventor finds that abnormal data can be identified from a large amount of training data, and the abnormal data is removed from the training data, so that the accuracy of the finally obtained training data can be ensured, and the accuracy of model training by using the training data is further improved.
Specifically, an embodiment of the present invention provides a method, an apparatus, and an electronic device for determining training data, which acquire training data to be analyzed and a model prediction result of the training data to be analyzed, perform an anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and use the training data as abnormal training data, delete the abnormal training data from the training data to be analyzed to obtain first training data, determine a first residual value corresponding to the first training data, determine a first association relationship between a feature value of the training data to be analyzed and a corresponding label value, determine a second association relationship between a feature value of the first training data and a corresponding label value, determine a replacement feature based on the first association relationship and the second association relationship, and use the replacement feature, replacing the characteristics of the first training data to obtain second training data, performing data adding operation on the second training data to obtain third training data with the same quantity as the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.
It should be noted that the training data determination method, the training data determination device and the electronic device provided by the invention can be used in the field of artificial intelligence or the field of finance. The above description is only an example, and does not limit the application fields of the method, the apparatus, and the electronic device for determining training data provided by the present invention.
On the basis of the above, an embodiment of the present invention provides a method for determining training data, and with reference to fig. 1, the method may include:
and S11, acquiring training data to be analyzed and a model prediction result of the training data to be analyzed.
In this embodiment, training data during model training is obtained, and training data with higher reliability needs to be screened out from the data subsequently, so that the training data obtained at this time is referred to as training data to be analyzed.
Each training data to be analyzed corresponds to a label value, for example, the training data to be analyzed is a picture, and the label value is a cat or a dog.
The invention can optimize the model in the process of training the model or using the model to provide services for the outside. At this time, each training data to be analyzed can be input into the model to obtain a model prediction result. The model prediction result may be a numerical value and a classification result, such as a cat, with a numerical value of 90.
And S12, carrying out abnormity analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and taking the training data as abnormal training data.
In this embodiment, abnormal training data to be analyzed is screened out based on the model prediction result, and is used as abnormal training data.
And S13, deleting the abnormal training data from the training data to be analyzed to obtain first training data.
After the abnormal training data is determined, the abnormal training data is deleted from the training data to be analyzed, and the remaining data in the training data to be analyzed is called first training data, namely data which is preliminarily considered to be normal and not forged.
After the first training data is determined, the first training data may be verified through subsequent steps to determine that the obtained first training data is accurate.
The verification method comprises the steps of respectively carrying out data redundancy and feature processing, wherein the data redundancy is to carry out up-sampling on first training data, obtain a richer data set and calculate to obtain a residual value M; the feature processing is to perform feature processing such as feature screening, filtering, feature combination and the like on the data set, calculate a residual value N of the data set, and if the difference between the two residual values is not large, the first training data is accurate. The authentication process will be described in detail later.
S14, determining a first incidence relation between the characteristic value of the training data to be analyzed and the corresponding label value, and determining a second incidence relation between the characteristic value of the first training data and the corresponding label value.
In this embodiment, each training data to be analyzed has a corresponding feature value, and taking cat as an example, the feature value may be a feature value corresponding to a part such as an eye, an ear, and the like, and then each training data to be analyzed has a label value corresponding to the above corresponding content, so as to refer to the above corresponding description.
Then, the correlation between the characteristic values of all the training data to be analyzed and the corresponding label values is calculated to determine the correlation between each characteristic value and the label value, for example, the ear characteristic of the cat proves that the correlation is the highest, and the eye characteristic of the cat proves that the correlation is the second highest, and the tail characteristic of the cat … … proves that the correlation is the lowest. The obtained first association relationship is the sorting result of the correlation between each feature value and the tag value, and if the first association relationship is:
1-ear, 2-eye … … 5-tail.
And for the first training data, similarly calculating to obtain a second association relation between the characteristic value of the first training data and the corresponding label value. If the second relationship is:
1-tail, 2-ear, 3-eye … ….
S15, determining a replacement feature based on the first incidence relation and the second incidence relation.
Specifically, referring to fig. 2, step S15 may include:
and S21, taking the characteristic value with the largest change of the characteristic sequence number value in the first association relation and the second association relation as a first alternative characteristic.
Specifically, the characteristic number value may be 12 … … 5 in the above-described association relationship. At the moment, the feature serial number values of the same feature in the first association relation and the second association relation are subjected to difference calculation, the absolute value after difference calculation is taken, and the feature with the maximum absolute value is taken as the first replacement feature.
S22, randomly screening out a preset number of characteristics from the characteristic values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation, and taking the characteristics as second replacement characteristics.
After the first alternative feature is determined, a predetermined number, e.g., 2, of features are then selected from the remaining features as the second alternative feature.
And S23, taking the first replacement characteristic and the second replacement characteristic as replacement characteristics.
In this embodiment, the first alternative feature and the second alternative feature are selected as the alternative features, because some features are combinations of multiple sub-features, for example, the tail feature may be a combination of features such as hair color, tail length, hair gloss, and the like. The method has the advantages that the characteristics can be combined and filtered in a characteristic processing mode, cheaters are prevented from cheating data on the characteristics, and the detection capability of the scheme is improved.
S16, replacing the feature of the first training data by using the replacement feature to obtain second training data, and determining a first residual value corresponding to the first training data and the second training data.
Specifically, replacing the replacement features in the first training data to obtain second training data, calculating a residual error of the feature values of the first training data, calculating a residual error of the feature values of the second training data, and taking a weighted sum result of the two residual errors as a first residual value.
The first residual value can represent the residual value of the normal sample, after the abnormal record is filtered by a statistical method, the characteristics of the normal sample can be initially reserved, and the sample deviation level of removing the abnormal value can be calculated by calculating the residual value.
In addition to the above feature replacement, a feature filtering operation may be performed, and at this time, the replaced features in the first training data may be directly filtered to obtain the second training data.
In particular, the details of the model may be clearer for the data cheaters, because they may forge features of the model, and then the detection model needs to perform feature processing on the sample submitted by the user, and the residual value level of the sample is calculated multiple times through the methods of feature combination and feature filtering, so as to help find out which features are forged by the customer.
And S17, performing data adding operation on the first training data to obtain third training data with the same quantity as the training data to be analyzed.
Specifically, assuming that the number of the training data to be analyzed is 1000, after the abnormal training data is removed, the remaining first training data is 900, and in order to keep the same number as that of the training data to be analyzed, in this embodiment, a data adding operation is performed on the first training data, at this time, the data distribution of the first training data may be calculated, and 100 new data are generated according to the data distribution, and the two are combined to obtain 1000 pieces of data, which is referred to as third training data.
And S18, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value.
Specifically, a second residual value of the feature value of the third training data is calculated, then the difference value between the first residual value and the second residual value is compared, if the difference value is smaller than a preset threshold value, it is indicated that the third training data after data addition can still be regarded as normal data, that is, the first data of the third training data is further verified to be normal data, and the reliability of normal data screening is improved.
Thereafter, model training and optimization operations are performed using the first training data.
In this embodiment, training data to be analyzed and a model prediction result of the training data to be analyzed are obtained, an anomaly analysis is performed on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, the training data to be analyzed is used as abnormal training data, the abnormal training data is deleted from the training data to be analyzed to obtain first training data, a first residual value corresponding to the first training data is determined, a first association relationship between a feature value of the training data to be analyzed and a corresponding label value is determined, a second association relationship between the feature value of the first training data and the corresponding label value is determined, a replacement feature is determined based on the first association relationship and the second association relationship, the feature of the first training data is replaced by using the replacement feature to obtain second training data, and performing data adding operation on the second training data to obtain third training data with the same quantity as the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.
In addition, the method can identify the machine learning model security problems caused by various black boxes and white boxes of developers, internal and external clients, multi-mode data cheating and the like, can find the replacement data and abnormal data of attackers as much as possible through residual analysis and feature replacement in the implementation process of the method, can accurately identify abnormal users, is beneficial to finding model problems in advance, and avoids causing unnecessary economic loss.
On the basis of the above, referring to fig. 3, step S12 "performing anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and using the training data as the abnormal training data" may include:
and S31, classifying the model prediction result according to the label value to obtain a classification result.
The label value in this embodiment has been introduced in the above embodiment, and at this time, the model prediction result, such as cat, may be classified into at least one category according to the label, and the classification result, such as cat or dog, may be obtained.
And S32, taking the training data to be analyzed corresponding to the model prediction result with the numerical value meeting the preset data interval range in the classification result as first abnormal training data.
In this embodiment, the training data to be analyzed corresponding to the model prediction result with a larger numerical value (e.g., greater than 96) and a smaller numerical value (less than 10) in the classification result is considered as abnormal data, and is referred to as first abnormal training data in this embodiment.
The above-described data having a larger or smaller value is abnormal data because:
for samples consisting of normal data sets, the distribution of the samples is consistent, the predicted results are less different after upsampling or feature engineering, and the fluctuation of the predicted results is less. Therefore, for data with a value greater than a system specific threshold value P or less than a specific threshold value Q calculated by the method, the data can be determined to be high-risk abnormal data, and for the high-risk abnormal data, the data can be confirmed in a manual mode, wherein P > Q.
And S32, obtaining user behavior data corresponding to the model prediction result, and carrying out clustering analysis on the user behavior data to obtain a clustering result.
In this embodiment, the user behavior data mainly includes usage time, usage times, usage period, and the like.
The number of times of use is the number of times of continuous use when the user uses the device, such as use in the morning, continuous use for 3 times, the number of times of use for 3 times, and the use time for 8 am. Every two weeks, the usage period is two weeks.
An external user may use a data cheating and cheating model to achieve some illegal purposes, and maliciously guide an artificial intelligence model to carry out wrong prediction. A typical financial application scenario is an automatic credit granting model for a bank, and if external customers are participants of the model construction, the external customers can add a normal sample set by constructing a specific data set or reconstruct the normal sample set by adding a counterfeit feature, so that the defects of the model are utilized to carry out illegal prediction, and the purpose of illegal credit granting is realized. Therefore, in this embodiment, the user behavior data is analyzed to find the user with abnormal operation frequency, and the data of the user is likely to be abnormal training data.
After obtaining each user behavior data, performing cluster analysis on the user behavior data to obtain a cluster result. For example, most users will use the model from 9.00 to 13.00 in the morning, with 1 use.
And S33, determining abnormal users based on the clustering result, and taking the training data to be analyzed corresponding to the abnormal users as second abnormal training data.
In this embodiment, after the clustering result is obtained, the user corresponding to the data having a large difference from the clustering result is an abnormal user.
Specifically, referring to fig. 4, determining an abnormal user based on the clustering result may include:
and S41, screening out users with the user behavior data meeting the preset behavior rules based on the clustering result, and taking the users as initial abnormal users.
In this embodiment, the preset behavior rule may be that at least one of the usage time, the usage times, and the usage period exceeds a set value, that is, the usage habits of the user are different from those of most users, and the usage habits of most users may be determined based on the clustering result, for example, most users are 9.00-13.00 usage models in the morning, while some users are used 3.00 days 21.00-the next day in the evening, and different from most users, they are considered as initial abnormal users.
And S42, taking the training data to be analyzed of the initial abnormal user as fourth training data, and taking the training data to be analyzed of the non-initial abnormal user as fifth training data.
At this time, the data of the initial abnormal user and the data of the non-initial abnormal user are divided into the fourth training data and the fifth training data, respectively.
And S43, calculating a third residual value of the feature value of the fifth training data.
The specific residual error calculation process refers to the corresponding parts.
And S44, performing feature replacement on the fourth training data to obtain sixth training data, and calculating a fourth residual value of the feature value of the sixth training data.
The characteristic replacement and residual error calculation process refers to the corresponding parts, and the characteristic replacement can also be characteristic filtering.
S45, taking the initial abnormal user as an abnormal user under the condition that the third residual value and the fourth residual value meet a preset residual value rule.
Specifically, if the difference between the third residual value of the normal user and the fourth residual value of the initial abnormal user is too large and larger than the set threshold, the initial abnormal user is considered to be actually different from the normal user, and the initial abnormal user is taken as the abnormal user.
Further, the higher the fourth residual value is, the shorter the usage cycle is and the higher the number of services is, the higher the possibility that the user is not abnormal is.
And after the abnormal user is determined, taking the training data to be analyzed corresponding to the abnormal user as second abnormal training data.
And S34, combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
In this embodiment, the first abnormal training data and the second abnormal training data are combined to obtain the abnormal training data.
On the basis of this embodiment, under the condition that the difference between the first residual value and the second residual value is not less than the preset threshold, the preset behavior rule may be adjusted to re-determine the initial abnormal user and the abnormal user, and then "perform the abnormal analysis on the model prediction result again to determine the training data to be analyzed corresponding to the abnormal model prediction result and use the training data as the abnormal training data", and then perform the steps sequentially until the difference between the first residual value and the second residual value is less than the preset threshold, and stop when the first training data is used as the training data used in the model training.
In this embodiment, a process of determining abnormal training data is provided, and then the abnormal training data can be determined through the above process, and the abnormal data is eliminated, so that normal training data is obtained, and the accuracy of the trained model is ensured.
Optionally, on the basis of the above embodiment of the method for determining training data, another embodiment of the present invention provides a device for determining training data, and with reference to fig. 5, the method may include:
the data acquisition module 11 is configured to acquire training data to be analyzed and a model prediction result of the training data to be analyzed;
the anomaly analysis module 12 is configured to perform anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and use the training data as the abnormal training data;
the data deleting module 13 is configured to delete the abnormal training data from the training data to be analyzed to obtain first training data;
a relationship determining module 14, configured to determine a first association relationship between the feature value of the training data to be analyzed and the corresponding tag value, and determine a second association relationship between the feature value of the first training data and the corresponding tag value;
a feature determination module 15, configured to determine a replacement feature based on the first association relation and the second association relation;
a replacing module 16, configured to replace, by using the replacement feature, the feature of the first training data to obtain second training data, and determine a first residual value corresponding to the first training data and the second training data;
a data adding module 17, configured to perform data adding operation on the first training data to obtain third training data with the same number as that of the training data to be analyzed;
a data determining module 18, configured to determine a second residual value corresponding to the third training data, and use the first training data as training data used in model training when a difference value between the first residual value and the second residual value is smaller than a preset threshold.
Further, the anomaly analysis module includes:
the classification submodule is used for classifying the model prediction result according to the label value to obtain a classification result;
the first data determination submodule is used for taking training data to be analyzed corresponding to the model prediction result of which the numerical value meets the range of a preset data interval in the classification result as first abnormal training data;
the clustering submodule is used for acquiring user behavior data corresponding to a model prediction result and carrying out clustering analysis on the user behavior data to obtain a clustering result;
the second data determining submodule is used for determining an abnormal user based on the clustering result and taking training data to be analyzed corresponding to the abnormal user as second abnormal training data;
and the third data determining submodule is used for combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
Further, the second data determination submodule includes:
the user screening unit is used for screening out users of which the user behavior data meet a preset behavior rule based on the clustering result and taking the users as initial abnormal users;
the data determining unit is used for taking the training data to be analyzed of the initial abnormal user as fourth training data and taking the training data to be analyzed of the non-initial abnormal user as fifth training data;
a first residual calculation unit configured to calculate a third residual value of the feature value of the fifth training data;
a second residual calculation unit, configured to perform feature replacement on the fourth training data to obtain sixth training data, and calculate a fourth residual of a feature value of the sixth training data;
and the user determining unit is used for taking the initial abnormal user as an abnormal user under the condition that the third residual value and the fourth residual value meet a preset residual value rule.
Further, the feature determination module includes:
a first feature determining unit, configured to use, as a first replacement feature, a feature value of the first association relationship and the second association relationship, where the feature sequence number value changes the most;
the characteristic screening unit is used for randomly screening out a preset number of characteristics from the characteristic values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation and taking the characteristics as second replacement characteristics;
a second feature determination unit configured to use the first replacement feature and the second replacement feature as replacement features.
Further, the anomaly analysis module 12 is further configured to:
and under the condition that the difference value between the first residual value and the second residual value is not smaller than a preset threshold value, performing anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and taking the training data as anomalous training data.
In this embodiment, training data to be analyzed and a model prediction result of the training data to be analyzed are obtained, an anomaly analysis is performed on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, the training data to be analyzed is used as abnormal training data, the abnormal training data is deleted from the training data to be analyzed to obtain first training data, a first residual value corresponding to the first training data is determined, a first association relationship between a feature value of the training data to be analyzed and a corresponding label value is determined, a second association relationship between the feature value of the first training data and the corresponding label value is determined, a replacement feature is determined based on the first association relationship and the second association relationship, the feature of the first training data is replaced by using the replacement feature to obtain second training data, and performing data adding operation on the second training data to obtain third training data with the same quantity as the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.
It should be noted that, for the working processes of each module, sub-module, and unit in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.
Optionally, on the basis of the embodiments of the method and the apparatus for determining training data, another embodiment of the present invention provides an electronic device, including: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls a program and is used to perform one of the above-described methods of training data determination.
In this embodiment, training data to be analyzed and a model prediction result of the training data to be analyzed are obtained, an anomaly analysis is performed on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, the training data to be analyzed is used as abnormal training data, the abnormal training data is deleted from the training data to be analyzed to obtain first training data, a first residual value corresponding to the first training data is determined, a first association relationship between a feature value of the training data to be analyzed and a corresponding label value is determined, a second association relationship between the feature value of the first training data and the corresponding label value is determined, a replacement feature is determined based on the first association relationship and the second association relationship, the feature of the first training data is replaced by using the replacement feature to obtain second training data, and performing data adding operation on the second training data to obtain third training data with the same quantity as the training data to be analyzed, determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value. According to the method, normal first training data are screened from the training data to be analyzed through the steps, the first training data are used as training data used in model training, accuracy of the model training data is guaranteed, and accuracy of the neural network model obtained based on training of the training data is improved.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for determining training data, comprising:
acquiring training data to be analyzed and a model prediction result of the training data to be analyzed;
carrying out anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and using the training data as abnormal training data;
deleting the abnormal training data from the training data to be analyzed to obtain first training data;
determining a first incidence relation between the characteristic value of the training data to be analyzed and the corresponding label value, and determining a second incidence relation between the characteristic value of the first training data and the corresponding label value;
determining a replacement feature based on the first incidence relation and the second incidence relation;
replacing the feature of the first training data by using the replacement feature to obtain second training data, and determining a first residual value corresponding to the first training data and the second training data;
performing data adding operation on the first training data to obtain third training data with the same quantity as the training data to be analyzed;
and determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value.
2. The method for determining training data according to claim 1, wherein performing anomaly analysis on the model prediction result to determine training data to be analyzed corresponding to the abnormal model prediction result, and using the training data as the abnormal training data includes:
classifying the model prediction result according to the label value to obtain a classification result;
taking training data to be analyzed corresponding to the model prediction result of which the numerical value meets the range of a preset data interval in the classification result as first abnormal training data;
acquiring user behavior data corresponding to a model prediction result, and performing clustering analysis on the user behavior data to obtain a clustering result;
determining abnormal users based on the clustering result, and taking training data to be analyzed corresponding to the abnormal users as second abnormal training data;
and combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
3. The method for determining training data according to claim 2, wherein determining an abnormal user based on the clustering result includes:
screening out users of which the user behavior data meet a preset behavior rule based on the clustering result, and taking the users as initial abnormal users;
taking the training data to be analyzed of the initial abnormal user as fourth training data, and taking the training data to be analyzed of the non-initial abnormal user as fifth training data;
calculating a third residual value of the feature value of the fifth training data;
performing feature replacement on the fourth training data to obtain sixth training data, and calculating a fourth residual value of a feature value of the sixth training data;
and under the condition that the third residual value and the fourth residual value meet a preset residual value rule, taking the initial abnormal user as an abnormal user.
4. The method of determining training data according to claim 2, wherein determining an alternative feature based on the first and second correlations comprises:
taking the characteristic value with the maximum change of the characteristic sequence number value in the first association relation and the second association relation as a first replacement characteristic;
randomly screening out a preset number of features from the feature values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation, and taking the features as second replacement features;
and taking the first replacement feature and the second replacement feature as replacement features.
5. The method of determining training data according to claim 2, further comprising:
and returning to execute the step of performing abnormal analysis on the model prediction result under the condition that the difference value between the first residual value and the second residual value is not less than a preset threshold value so as to determine training data to be analyzed corresponding to the abnormal model prediction result, using the training data as abnormal training data, and sequentially executing the steps until the first training data is used as training data used in model training under the condition that the difference value between the first residual value and the second residual value is less than the preset threshold value.
6. An apparatus for determining training data, comprising:
the data acquisition module is used for acquiring training data to be analyzed and a model prediction result of the training data to be analyzed;
the abnormal analysis module is used for performing abnormal analysis on the model prediction result so as to determine training data to be analyzed corresponding to the abnormal model prediction result and using the training data as abnormal training data;
the data deleting module is used for deleting the abnormal training data from the training data to be analyzed to obtain first training data;
the relation determining module is used for determining a first incidence relation between the characteristic value of the training data to be analyzed and the corresponding label value and determining a second incidence relation between the characteristic value of the first training data and the corresponding label value;
a feature determination module for determining a replacement feature based on the first incidence relation and the second incidence relation;
the replacing module is used for replacing the characteristics of the first training data by using the replacing characteristics to obtain second training data and determining a first residual value corresponding to the first training data and the second training data;
the data adding module is used for performing data adding operation on the first training data to obtain third training data with the same quantity as the training data to be analyzed;
and the data determining module is used for determining a second residual value corresponding to the third training data, and taking the first training data as training data used in model training under the condition that the difference value between the first residual value and the second residual value is smaller than a preset threshold value.
7. The apparatus for determining training data according to claim 6, wherein the abnormality analysis module includes:
the classification submodule is used for classifying the model prediction result according to the label value to obtain a classification result;
the first data determination submodule is used for taking training data to be analyzed corresponding to the model prediction result of which the numerical value meets the range of a preset data interval in the classification result as first abnormal training data;
the clustering submodule is used for acquiring user behavior data corresponding to a model prediction result and carrying out clustering analysis on the user behavior data to obtain a clustering result;
the second data determining submodule is used for determining an abnormal user based on the clustering result and taking training data to be analyzed corresponding to the abnormal user as second abnormal training data;
and the third data determining submodule is used for combining the first abnormal training data and the second abnormal training data to obtain abnormal training data.
8. The apparatus for determining training data according to claim 7, wherein the second data determining submodule includes:
the user screening unit is used for screening out users of which the user behavior data meet a preset behavior rule based on the clustering result and taking the users as initial abnormal users;
the data determining unit is used for taking the training data to be analyzed of the initial abnormal user as fourth training data and taking the training data to be analyzed of the non-initial abnormal user as fifth training data;
a first residual calculation unit configured to calculate a third residual value of the feature value of the fifth training data;
a second residual calculation unit, configured to perform feature replacement on the fourth training data to obtain sixth training data, and calculate a fourth residual of a feature value of the sixth training data;
and the user determining unit is used for taking the initial abnormal user as an abnormal user under the condition that the third residual value and the fourth residual value meet a preset residual value rule.
9. The apparatus for determining training data according to claim 7, wherein the feature determination module comprises:
a first feature determining unit, configured to use, as a first replacement feature, a feature value of the first association relationship and the second association relationship, where the feature sequence number value changes the most;
the characteristic screening unit is used for randomly screening out a preset number of characteristics from the characteristic values with the maximum non-sequence number value change in the first incidence relation and the second incidence relation and taking the characteristics as second replacement characteristics;
a second feature determination unit configured to use the first replacement feature and the second replacement feature as replacement features.
10. An electronic device, comprising: a memory and a processor;
wherein the memory is used for storing programs;
the processor calls the program and is arranged to perform a method of determining training data as claimed in any one of claims 1 to 5.
CN202111422756.1A 2021-11-26 2021-11-26 Method and device for determining training data and electronic equipment Pending CN114021667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111422756.1A CN114021667A (en) 2021-11-26 2021-11-26 Method and device for determining training data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111422756.1A CN114021667A (en) 2021-11-26 2021-11-26 Method and device for determining training data and electronic equipment

Publications (1)

Publication Number Publication Date
CN114021667A true CN114021667A (en) 2022-02-08

Family

ID=80066631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111422756.1A Pending CN114021667A (en) 2021-11-26 2021-11-26 Method and device for determining training data and electronic equipment

Country Status (1)

Country Link
CN (1) CN114021667A (en)

Similar Documents

Publication Publication Date Title
CN109598095B (en) Method and device for establishing scoring card model, computer equipment and storage medium
US20180308160A1 (en) Risk assessment method and system
CN112017040B (en) Credit scoring model training method, scoring system, equipment and medium
CN109801151B (en) Financial falsification risk monitoring method, device, computer equipment and storage medium
CN111582341B (en) User abnormal operation prediction method and device
CN111428217A (en) Method and device for identifying cheat group, electronic equipment and computer readable storage medium
CN111950889A (en) Client risk assessment method and device, readable storage medium and terminal equipment
CN112258312A (en) Personal credit scoring method and system, electronic device and storage medium
CN111061948A (en) User label recommendation method and device, computer equipment and storage medium
CN116883153A (en) Pedestrian credit investigation-based automobile finance pre-credit rating card development method and terminal
CN114697127B (en) Service session risk processing method based on cloud computing and server
CN114021667A (en) Method and device for determining training data and electronic equipment
CN111859057B (en) Data feature processing method and data feature processing device
CN115907954A (en) Account identification method and device, computer equipment and storage medium
CN114841705A (en) Anti-fraud monitoring method based on scene recognition
CN110570301B (en) Risk identification method, device, equipment and medium
CN115660849B (en) Virtual asset transaction identification method and system based on money back flushing strategy
CN113537666B (en) Evaluation model training method, evaluation and business auditing method, device and equipment
EP4372593A1 (en) Method and system for anonymizsing data
CN116308807A (en) Model updating method and device, nonvolatile storage medium and electronic equipment
CN113112043A (en) Method, device and equipment for determining abnormal resource transfer condition
CN117609919A (en) Method and device for identifying card-related vending customers, electronic equipment and storage medium
CN115908006A (en) Financial product recommendation method, system, equipment and medium based on decision tree
CN117094808A (en) Method, device, equipment, storage medium and product for predicting default
CN116975752A (en) User tag prediction method, device, electronic equipment and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination