CN116467602A

CN116467602A - Training data generation method, device, computer equipment and storage medium

Info

Publication number: CN116467602A
Application number: CN202310467910.XA
Authority: CN
Inventors: 张�诚; 程佩哲; 韩玮祎
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-07-21

Abstract

The application relates to a training data generation method, a training data generation device, computer equipment and a storage medium. The application relates to the technical field of information security and artificial intelligence. The method comprises the following steps: acquiring an initial service data set; performing supplementary processing on each initial service data according to a preset supplementary strategy, and performing data cleaning processing on the supplementary initial service data to obtain each service data; clustering is carried out on each service data to obtain each service data group, and the service data group meeting the preset clustering condition is screened from each service data group to be an initial target training group; and extracting a plurality of business data from each initial target training group to serve as initial target training data, and carrying out data splitting and recombination processing on each initial target training data to obtain each target training data, wherein the target training data are used for training the artificial intelligent model. By adopting the method, the attack data in the generated service data can be reduced.

Description

Training data generation method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of information security technologies, and in particular, to a training data generating method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of the financial industry, the financial industry needs various artificial intelligence models to assist the data processing of the financial industry, but in the process of training the artificial intelligence models, attack data of abnormal clients are often damaged, so that the artificial intelligence models cannot normally run due to the attack, and therefore, how to defend the attack data is the key point of the current research.

The traditional attack data defending mode is to divide and reconstruct all the business data on the data layer, so as to reduce the influence of the potential attack data on the business data set. However, attack data cannot be directly removed by the method, and a large amount of normal data and a large amount of abnormal data are subjected to data segmentation and recombination at the same time, so that the datanature of the original normal data can be greatly destroyed, and the situation that the data of the large amount of original normal data is abnormal can be caused, so that an artificial intelligent model cannot be trained.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a training data generation method, apparatus, computer device, computer readable storage medium, and computer program product.

In a first aspect, the present application provides a training data generation method. The method comprises the following steps:

Acquiring an initial service data set; the initial service data set comprises a plurality of initial service data;

performing supplementary processing on each piece of initial service data according to a preset supplementary strategy, and performing data cleaning processing on the supplementary initial service data to obtain each piece of service data;

clustering the service data to obtain service data sets, and screening the service data sets meeting preset clustering conditions from the service data sets to serve as initial target training sets;

and extracting a plurality of business data from each initial target training group to serve as initial target training data, and carrying out data splitting and recombination processing on each initial target training data to obtain each target training data, wherein the target training data are used for training an artificial intelligent model.

Optionally, the performing a complementary process on each piece of initial service data according to a preset complementary policy, and performing a data cleaning process on the complementary processed initial service data to obtain each piece of service data, where the method includes:

identifying the data attribute of each initial service data and the first abnormal initial service data with blank data in each initial service data, and supplementing the blank data of each first abnormal initial service data according to each other data with the same data attribute of each first abnormal initial service data to obtain each first service data;

Identifying second abnormal data in each first service data through an abnormal data detection strategy, and deleting the second abnormal data to obtain each second service data;

and sequencing the second service data to obtain a data sequence of the initial service data, and carrying out equipartition smoothing processing on the initial service data based on the data sequence to obtain the service data.

Optionally, the data attribute of the initial service data includes a numerical attribute and a non-numerical attribute; the supplementing the blank data of each first abnormal initial service data according to the other data of the same data attribute of each first abnormal initial service data comprises the following steps:

selecting first initial business data which have the same data attribute as the first abnormal initial business data except all first abnormal initial business data in the initial business data according to each first abnormal initial business data, and selecting each piece of data to be supplemented corresponding to the vacant data in each first initial business data according to the vacant data of the first abnormal initial business data;

under the condition that the data attribute of the first abnormal initial service data is a numerical value attribute, carrying out average processing on each piece of data to be supplemented to obtain supplementary data, and supplementing the supplementary data to the first abnormal initial service data;

And under the condition that the data attribute of the first abnormal initial service data is a non-numerical attribute, the data to be supplemented with the largest amount of the same data to be supplemented is used as the supplementary data, and the supplementary data is supplemented to the first abnormal initial service data.

Optionally, the identifying, by the abnormal data detection policy, the second abnormal data in each of the first service data includes:

establishing a quartile bin pattern of each first service data based on each first service data, and identifying an outlier of each first service data based on the quartile bin pattern;

and taking the first service data corresponding to the abnormal value meeting the preset abnormal condition as second abnormal data.

Optionally, the sorting the second service data to obtain a data sequence of each initial service data, and performing average smoothing processing on each initial service data based on the data sequence to obtain each service data, including:

determining target features contained in the second service data based on the features of the second service data, and calculating feature values of the target features of the second service data;

Sequencing the second service data according to the sequence from the big to the small of the characteristic value of the second service data to obtain a data sequence of the second service data;

and dividing the data sequence into a plurality of equal-depth second service data sets through an equal-depth box division algorithm, and carrying out data smoothing processing on each equal-depth second service data set to obtain each service data.

Optionally, the clustering processing is performed on each service data to obtain each service data set, and a service data set meeting a preset clustering condition is screened from each service data set to be an initial target training set, including:

clustering the business data through a cluster analysis strategy to obtain business data groups;

and identifying an outlier service data group corresponding to the outlier service data in each service data group, and taking each service data group except each outlier service data group as an initial target training group.

Optionally, the extracting a plurality of service data from each initial target training group as initial target training data includes:

and respectively extracting a plurality of business data in each initial target training group by a random subsampling mode to serve as initial target training data.

Optionally, the extracting a plurality of service data in each initial target training group by a random subsampling method as initial target training data includes:

based on the preset sampling number, respectively extracting a plurality of service data in each initial target training group in a random subsampling mode to obtain first service data, and extracting the first service data with the preset sampling number from the first service data as initial target training data;

and in each initial target training group except the initial target training data, returning to execute the steps of respectively extracting a plurality of business data to obtain the first business data until the number of all initial target training data meets the number threshold of the initial target training data, and outputting the initial target training data.

Optionally, the performing data splitting and reorganizing processing on each initial target training data to obtain each target training data includes:

dividing each initial target training data by a data dividing algorithm to obtain each divided data;

and respectively carrying out pairwise recombination processing on each piece of segmentation data through a data recombination algorithm to obtain each piece of target training data.

In a second aspect, the present application further provides a training data generating apparatus. The device comprises:

the acquisition module is used for acquiring an initial service data set; the initial service data set comprises a plurality of initial service data;

the processing module is used for carrying out supplementary processing on each initial service data according to a preset supplementary strategy, and carrying out data cleaning processing on the supplementary initial service data to obtain each service data;

the screening module is used for carrying out clustering processing on each service data to obtain each service data set, and screening the service data sets meeting the preset clustering condition from each service data set to be an initial target training set;

and the reorganization module is used for extracting a plurality of business data from each initial target training group to serve as initial target training data, and carrying out data splitting reorganization processing on each initial target training data to obtain each target training data, wherein the target training data are used for training the artificial intelligent model.

Optionally, the processing module is specifically configured to:

Optionally, the screening module is specifically configured to:

Optionally, the recombination module is specifically configured to:

In a third aspect, the present application provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the processor executes the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium. On which a computer program is stored which, when being executed by a processor, implements the steps of the method of any of the first aspects.

In a fifth aspect, the present application provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

The training data generation method, the training data generation device, the training data generation computer equipment, the training data generation storage medium and the training data generation computer program product are used for acquiring an initial service data set; the initial service data set comprises a plurality of initial service data; performing supplementary processing on each piece of initial service data according to a preset supplementary strategy, and performing data cleaning processing on the supplementary initial service data to obtain each piece of service data; clustering the service data to obtain service data sets, and screening the service data sets meeting preset clustering conditions from the service data sets to serve as initial target training sets; and extracting a plurality of business data from each initial target training group to serve as initial target training data, and carrying out data splitting and recombination processing on each initial target training data to obtain each target training data, wherein the target training data are used for training an artificial intelligent model. The method comprises the steps of firstly supplementing blank data, deleting abnormal data and performing smooth processing on initial service data to obtain service data, then extracting service data with lower attack data from initial target training data through a cluster analysis strategy and a random sampling mode, and obtaining target training data through data splitting and recombination processing on the service data without the attack data, so that the abnormal data in the target training data are reduced, and the number of the target training data can normally train artificial intelligent data.

Drawings

FIG. 1 is a flow diagram of a training data generation method in one embodiment;

FIG. 2 is a flow chart illustrating steps for determining traffic data in one embodiment;

FIG. 3 is a flow diagram of an example of training data generation in one embodiment;

FIG. 4 is a block diagram of a training data generation apparatus in one embodiment;

fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The training data generation method provided by the embodiment of the application can be applied to a terminal, a server and a system comprising the terminal and the server, and is realized through interaction of the terminal and the server. The terminal may include, but is not limited to, various personal computers, notebook computers, tablet computers, and the like. The terminal obtains each service data by supplementing blank data, deleting abnormal data and smoothing each initial service data, extracts service data with lower attack data from each initial target training data by a clustering analysis strategy and a random sampling mode, and obtains target training data by carrying out data splitting and recombination processing on the service data without the attack data, thereby reducing the abnormal data in the target training data and enabling the number of the target training data to train artificial intelligent data normally.

In one embodiment, as shown in fig. 1, a training data generating method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S101, an initial service data set is acquired.

Wherein the initial service data set includes a plurality of initial service data.

In this embodiment, the terminal responds to a training operation of a user, acquires a plurality of initial service data to be trained, and uses a set of the plurality of initial service data as an initial training set. The method for acquiring the initial service data may be, but not limited to, initial service data input by a user, initial service data transmitted by other terminals, or initial service data generated by a service data generating program. The initial traffic data may include attack data. The initial business data is business data with a data label of the artificial intelligent model, wherein the data label is used for training the artificial intelligent model. Such as financial business data, statistical business data, and the like.

Step S102, according to a preset supplementing strategy, carrying out supplementing processing on each initial service data, and carrying out data cleaning processing on the initial service data after the supplementing processing to obtain each service data.

In this embodiment, the terminal performs data supplementation processing on the blank of each initial service data through a preset supplementation policy, then deletes abnormal data of each initial service data after supplementation processing, and finally performs smoothing processing on each initial service data after the processing procedure to obtain each service data. Thus, abnormal data in each initial business data is eliminated, so that each initial business data can be used for training a model. The data decontamination process includes, but is not limited to, filling missing data, eliminating abnormal data, smoothing noise data, correcting inconsistent data, removing noise, filling null values, losing values and processing inconsistent data in the data, and converting the data into a form suitable for machine learning model (i.e. model) training through normalization or standardization. The specific data decontamination process will be described in detail later.

Step S103, clustering processing is carried out on each service data to obtain each service data set, and the service data set meeting the preset clustering condition is screened from each service data set to be an initial target training set.

In this embodiment, the terminal performs cluster processing on all the service data through a cluster analysis policy to obtain a plurality of service data sets, where the cluster analysis method applied by the cluster analysis policy may be, but is not limited to, a cluster analysis method of unsupervised learning. And the terminal selects the service data group meeting the preset clustering condition from the service data groups as an initial target training group. The preset clustering condition may be, but not limited to, a preset number ratio, that is, a ratio of the number of service data in the preset service data group to the number of all service data. The terminal takes the business data group meeting the preset number proportion as an initial target data group.

Step S104, extracting a plurality of business data from each initial target training group to serve as initial target training data, and carrying out data splitting and recombination processing on each initial target training data to obtain each target training data, wherein the target training data are used for training an artificial intelligent model.

In this embodiment, the terminal extracts a plurality of service data from each initial target training set by a multiple random sampling manner, and uses each extracted service data as initial target training data. The specific extraction process will be described in detail later. The terminal performs data splitting processing on each initial target training data, and performs data recombination processing on each split data to obtain each target training data. The artificial intelligence model may be, but is not limited to, an artificial intelligence model related to financial business, such as an intelligent reservation model, an intelligent business processing model, an intelligent data statistics model, and the like.

Based on the scheme, the method comprises the steps of firstly supplementing blank data, deleting abnormal data and smoothing all initial service data to obtain all service data, then extracting service data with lower attack data from all initial target training data through a cluster analysis strategy and a random sampling mode, and obtaining target training data through carrying out data splitting and recombination processing on the service data without the attack data, thereby reducing the abnormal data in the target training data and enabling the number of the target training data to train artificial intelligent data normally.

Optionally, as shown in fig. 2, according to a preset supplementing policy, supplementing each initial service data, and performing data cleaning processing on the supplemented initial service data to obtain each service data, where the steps include:

step S201, identifying the data attribute of each initial service data and the first abnormal initial service data with blank data in each initial service data, and supplementing the blank data of each first abnormal initial service data according to each other data with the same data attribute of each first abnormal initial service data to obtain each first service data.

In this embodiment, the terminal identifies the data attribute of each initial service data based on each initial service data, and determines whether there is blank data in each initial service data. The terminal takes the initial service data with the vacant data as first abnormal initial service data. The terminal divides each first abnormal initial service data according to the data attribute to obtain each first abnormal initial service data group, and groups other data of each non-first abnormal initial service data according to the data attribute to obtain each initial service data group. And the terminal fills the blank data of each first abnormal initial service data through other data with the same data attribute according to the data filling strategy corresponding to each data attribute, so as to obtain each first service data. The data attributes may include numeric attributes, and non-numeric attributes, among others. The specific replenishment process will be described in detail later.

Step S202, identifying second abnormal data in each first service data through an abnormal data detection strategy, and deleting the second abnormal data to obtain each second service data.

In this embodiment, the terminal screens and identifies each second abnormal data in each first service data by the abnormal data detection policy, and deletes each second abnormal data from each first service data. The specific process of identifying the second abnormal data will be described in detail later.

Step S203, the second business data are sequenced to obtain the data sequence of the initial business data, and based on the data sequence, the initial business data are equally divided and smoothed to obtain the business data.

In this embodiment, the terminal determines the feature value of each second service data based on each second service data, and sorts each second service data according to the sequence from the high value to the low value of each second service data, so as to obtain the data sequence of each second service data. And then, the terminal divides each second service data into a plurality of data groups based on the data sequence, and performs smoothing processing on the second service data in each data group to obtain each service data. The specific smoothing process will be described in detail later.

Based on the scheme, the initial service data are subjected to impurity removal processing, so that the service data are obtained, attack data are extracted for the subsequent service data, and a data base is provided for training the model through the service data.

Optionally, the data attribute of the initial service data includes a numerical attribute and a non-numerical attribute; according to each other data of the same data attribute of each first abnormal initial service data, the blank data of each first abnormal initial service data is supplemented, and the method comprises the following steps: for each first abnormal initial service data, selecting first initial service data which are the same as the data attribute of the first abnormal initial service data except all first abnormal initial service data in each initial service data, and selecting each data to be supplemented corresponding to the vacant data in each first initial service data according to the vacant data of the first abnormal initial service data; under the condition that the data attribute of the first abnormal initial business data is a numerical value attribute, carrying out average processing on each piece of data to be supplemented to obtain supplementary data, and supplementing the supplementary data to the first abnormal initial business data; and under the condition that the data attribute of the first abnormal initial service data is a non-numerical attribute, the data to be supplemented with the largest amount of the same data to be supplemented is used as the supplementary data, and the supplementary data is supplemented to the first abnormal initial service data.

In this embodiment, the terminal selects, for each first abnormal initial service data, non-first abnormal initial service data having the same data attribute from among the non-first abnormal initial service data according to the data attribute of each first abnormal initial service data. And the terminal selects the data corresponding to the position of the vacant data from the non-first abnormal initial service data types with the same data attribute according to the position corresponding to the vacant data of the first abnormal initial service data, and uses the data corresponding to the position of the vacant data as the data to be supplemented. Then, the terminal judges the category of the data attribute of the first abnormal initial service data. And under the condition that the data attribute of the first abnormal initial service data is a numerical value attribute, the terminal performs data average processing on each data to be supplemented to obtain the supplemental data. The data average processing is numerical data obtained by carrying out average operation on numerical values. The terminal supplements the first abnormal initial service data based on the supplementary data. And under the condition that the data attribute of the first abnormal initial service data is a non-numerical attribute, the terminal inquires the data to be supplemented with the largest occurrence number in the data to be supplemented, and takes the data to be supplemented with the largest occurrence number as the supplementing data (namely, the same data to be supplemented with the largest number of the data to be supplemented is taken as the supplementing data). The terminal supplements the first abnormal initial service data based on the supplementary data. In the same way, through the scheme, the terminal supplements all first abnormal service data based on the supplementary data.

Based on the scheme, the first abnormal initial service data is supplemented by the supplementary data in the non-first abnormal initial service data, so that the number of the initial service data is reserved to the greatest extent while the blank data of the first abnormal initial service data is filled, and the integrity of the initial service data is ensured.

Optionally, identifying the second abnormal data in each service data through an abnormal data detection policy includes: establishing a quartile bin pattern of the first service data set based on each first service data, and identifying an abnormal value of each first service data based on the quartile bin pattern; and taking the first service data corresponding to the abnormal value meeting the preset abnormal condition as second abnormal data.

In this embodiment, the terminal establishes a quartile bin pattern of the first service data set based on the first service data set through an abnormal data detection policy. The terminal identifies the abnormal value of each first service data based on the quartile bin diagram, wherein the abnormal data detection strategy is the establishment program of the quartile bin diagram. As shown in fig. 3, the quartile bin graph may be divided into upper bound, lower bound, upper quartile, mean, median, lower quartile, etc., where the quartile: a value (Q1) corresponding to 25% quantiles; median: a value (Q2) corresponding to 50% quantiles; upper quartile: a value (Q3) corresponding to 75% quantiles; upper bound: q3+1.5 (Q3-Q1); the lower bound: q1-1.5 (Q3-Q1); wherein Q3-Q1 represent a quartile range. The terminal takes the value corresponding to the data point which is higher than the upper bound or lower than the lower bound as the discrete point value. The terminal takes the abnormal value corresponding to the first service data corresponding to the discrete point value as the abnormal value meeting the preset abnormal condition. Specifically, the terminal presets an abnormal threshold value, and takes first service data corresponding to an abnormal value with an abnormal value lower than the abnormal threshold value as second abnormal data.

Based on the scheme, abnormal data are identified through the quartile bin diagram, and accuracy of identifying the abnormal data is improved.

Optionally, ordering the second service data to obtain a data sequence of each initial service data, and performing equipartition smoothing processing on each initial service data based on the data sequence to obtain each service data, including: determining target features contained in the second service data together based on the features of the second service data, and calculating feature values of the target features of the second service data; sequencing the second business data according to the sequence of the characteristic values of the second business data from big to small to obtain a data sequence of the second business data; and dividing the data sequence into multiple equal-depth second service data sets through an equal-depth box division algorithm, and carrying out data smoothing on the equal-depth second service data sets to obtain the service data.

In this embodiment, the terminal selects, as the target feature, features included in all the second service data among the features of each of the second service data. Wherein the characteristics of the traffic data may be, but are not limited to, data size, data type, data run rate, data ordinal (i.e., how many times the data is ordered in importance among all data), etc. The characteristics contained in each service data may be different, and the number of target characteristics screened by the terminal is unique. The terminal calculates the characteristic value of the target characteristic of each second service data. And the terminal sorts the second service data according to the sequence from the big characteristic value to the small characteristic value of the second service data to obtain a data sequence of the second service data. The terminal divides the data sequence into a plurality of second service data groups with equal depth through an equal depth box division algorithm. Wherein the difference value between the first second service data of each sequence in each equal-depth second service data group and the second service data of the last bit of each sequence is the same, but the second service data contained in each equal-depth second service data group is different. And the terminal performs data smoothing processing on each second service data group to obtain each service data.

For example, the sizes of the second service data are 45, 43, 41, 40, 37, 33, and 32, and the sizes of the second service data included in the two sets of equal-depth second service data sets obtained by equally dividing the data are respectively the first set: 45 43, 41, 40, second group: 37, 33, 32. And the terminal performs data smoothing processing on the deep second service data sets according to the average value of the target characteristics of the deep second service data sets to obtain second service data subjected to smoothing processing. The feature values of the target features of the second service data after the smoothing processing are the same, for example, the sizes of the second service data contained in the two groups of equal-depth second service data groups obtained after the data is equally divided are respectively as the first group: 45 A second group of 42, 41, 40: 37 After the data smoothing process, the data size in each group of data is the first group: 42 The second group 34, 34, 34, 42, 42.

Based on the scheme, through carrying out smoothing processing on each second service data, noise and inconsistent data in each second service data are eliminated, and the defensive capability against attack data is improved.

Optionally, clustering is performed on each service data to obtain each service data set, and a service data set meeting a preset clustering condition is screened from each service data set to be an initial target training set, including: clustering the business data by a cluster analysis strategy to obtain business data groups; and identifying an outlier service data group corresponding to the outlier service data in each service data group, and taking each service data group except each outlier service data group as an initial target training group.

In this embodiment, the terminal performs clustering processing on each service data through a cluster analysis policy to obtain each service data set. The terminal presets a service data number threshold value in the service data groups and judges whether the service data groups lower than the service data number threshold value exist in each service data group. In the case that the service data groups below the service data number threshold exist in each service data group, the terminal takes each service data group below the service data number threshold as an outlier service data group. The terminal takes each service data group except each outlier service data group as an initial target training group. The cluster analysis strategy may be, but not limited to, a cluster strategy corresponding to an unsupervised learning cluster analysis method.

Based on the scheme, the outlier business data is taken out after clustering, and the frog boiling attack and the tag overturning attack of data poisoning by only injecting a small amount of toxic data are avoided. The defending capability on attack data is improved.

Optionally, extracting a plurality of service data from each initial target training group as initial target training data includes: and respectively extracting a plurality of business data in each initial target training group by a random subsampling mode to serve as initial target training data.

In this embodiment, the terminal extracts a plurality of service data from each initial target training set by means of random subsampling, and uses all the extracted service data as each initial target training data. The specific extraction process will be described in detail later.

Based on the scheme, attack data possibly existing in the initial target training data can be effectively removed through a random subsampling mode, and the training data generation efficiency is improved.

Optionally, extracting a plurality of service data from each initial target training group by a random subsampling method, as initial target training data, including: based on the preset sampling number, respectively extracting a plurality of service data in each initial target training group in a random subsampling mode to obtain first service data, and extracting the first service data with the preset sampling number from the first service data as initial target training data; and in each initial target training group except the initial target training data, returning to execute the steps of respectively extracting a plurality of business data to obtain the first business data until the number of all initial target training data meets the number threshold value of the initial target training data, and outputting the initial target training data.

In this embodiment, the terminal presets the number of samples and the threshold number, and extracts a plurality of service data from each initial target training array in a random subsampling manner, and uses the extracted service data as first service data. The terminal extracts first service data with preset sampling number from each first service data as initial target training data. And the terminal returns to execute the steps of respectively extracting a plurality of business data in each initial target training group except each initial target training data to obtain each first business data, re-extracting the business data again until the terminal stops iterative operation and outputs each initial target training data under the condition that the sum of the numbers of the initial target training data extracted repeatedly reaches a number threshold value.

Based on the scheme, the initial target training data in each service data group is extracted in a random subsampling mode, so that an attacker can be prevented from adding a limited number of poisoning data insertion attacks of any poisoning feature vector, and the defensive ability against the attacking data is improved by modifying the feature vector of any subset of the service data set or modifying the data of the tag.

Optionally, performing data splitting and reorganizing processing on each initial target training data to obtain each target training data, including: dividing each initial target training data by a data dividing algorithm to obtain each divided data; and (3) respectively carrying out pairwise recombination processing on each piece of divided data through a data recombination algorithm to obtain each piece of target training data.

In this embodiment, the terminal performs segmentation processing on each initial target training data through a data segmentation algorithm to obtain each segmented data. The data segmentation algorithm is any data segmentation algorithm which can realize the mode. The terminal performs pairwise recombination processing on each piece of divided data through a data recombination algorithm to obtain each piece of target training data. The data reorganization algorithm is any data reorganization algorithm capable of realizing the mode.

Based on the scheme, the initial target training data with a large amount of attack data removed is subjected to data splitting and recombination processing, so that splitting and recombination of a large amount of normal data and a large amount of attack data are avoided, usability of the split and recombined target training data is improved, and integrity of the target training data is ensured to the greatest extent.

The application also provides a training data generation example, as shown in fig. 3, and the specific processing procedure includes the following steps:

step S301, an initial service data set is acquired.

Step S302, selecting first abnormal initial business data which are the same as the data attribute of the first abnormal initial business data except all first abnormal initial business data in the initial business data, and selecting the data to be supplemented corresponding to the vacant data in the first initial business data according to the vacant data of the first abnormal initial business data.

Step S303, when the data attribute of the first abnormal initial service data is a numerical attribute, the data to be supplemented are subjected to average processing to obtain the supplementary data, and the supplementary data are supplemented to the first abnormal initial service data to obtain the first service data.

Step S304, under the condition that the data attribute of the first abnormal initial business data is a non-numerical value attribute, the data to be supplemented with the largest amount of the same data to be supplemented is used as the supplementary data, and the supplementary data is supplemented to the first abnormal initial business data to obtain the first business data.

Step S305, building a quartile bin pattern of the first service data set based on each first service data, and identifying an outlier of each first service data based on the quartile bin pattern.

Step S306, the first service data corresponding to the abnormal value meeting the preset abnormal condition is used as the second abnormal data, and the second abnormal data is deleted to obtain each second service data.

Step S307, based on the data attribute of each second service data, determines the same target feature between each second service data, and calculates the feature value of the target feature of each second service data.

Step S308, ordering the second business data according to the sequence from the big to the small of the characteristic value of the second business data to obtain the data sequence of the second business data.

Step S309, dividing the data sequence into multiple equal-depth second service data sets by an equal-depth box division algorithm, and performing data smoothing processing on the equal-depth second service data sets to obtain each service data.

Step S310, clustering processing is carried out on each service data through a clustering analysis strategy, so as to obtain each service data group.

Step S311, identifying an outlier service data set corresponding to the outlier service data in each service data set, and taking each service data set except each outlier service data set as an initial target training set.

Step S312, based on the preset sampling number, extracting a plurality of service data in each initial target training group by a random subsampling mode to obtain each first service data, and extracting the first service data with the preset sampling number from each first service data as initial target training data.

Step S313, in each initial target training group except the initial target training data, the step of extracting a plurality of business data respectively is returned to be executed to obtain each first business data, and the initial target training data are output until the number of all initial target training data meets the number threshold of the initial target training data.

In step S314, the segmentation process is performed on each initial target training data by using a data segmentation algorithm, so as to obtain each segmented data.

Step S315, performing pairwise recombination processing on each piece of divided data through a data recombination algorithm to obtain each piece of target training data.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a training data generating device for realizing the training data generating method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the training data generating apparatus provided below may refer to the limitation of the training data generating method hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 4, there is provided a training data generating apparatus including: an acquisition module 410, a processing module 420, a screening module 430, and a reorganization module 440, wherein:

an acquisition module 410, configured to acquire an initial service data set; the initial service data set comprises a plurality of initial service data;

the processing module 420 is configured to perform a complementary process on each of the initial service data according to a preset complementary policy, and perform a data cleaning process on the initial service data after the complementary process, so as to obtain each service data;

the screening module 430 is configured to perform clustering processing on each service data to obtain each service data set, and screen, from each service data set, a service data set that meets a preset clustering condition as an initial target training set;

And the reorganization module 440 is configured to extract a plurality of service data from each initial target training set, use the service data as initial target training data, and perform data splitting reorganization processing on each initial target training data to obtain each target training data, where the target training data is used for training an artificial intelligent model.

Optionally, the processing module 420 is specifically configured to:

Optionally, the screening module 430 is specifically configured to:

Optionally, the reorganization module 440 is specifically configured to:

The respective modules in the training data generation apparatus described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a training data generation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method of any of the first aspects when the computer program is executed.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of any of the first aspects.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of generating training data, the method comprising:

2. The method of claim 1, wherein the performing the supplemental processing on each of the initial service data according to the preset supplemental policy, and performing the data cleaning processing on the supplemental processed initial service data, to obtain each service data, includes:

3. The method of claim 2, wherein the data attributes of the initial business data include numeric attributes and non-numeric attributes; the supplementing the blank data of each first abnormal initial service data according to the other data of the same data attribute of each first abnormal initial service data comprises the following steps:

4. The method of claim 2, wherein the identifying the second anomaly data in each of the first traffic data via an anomaly data detection policy comprises:

5. The method of claim 2, wherein the sorting the second service data to obtain a data sequence of each initial service data, and performing average smoothing on each initial service data based on the data sequence to obtain each service data comprises:

6. The method of claim 1, wherein clustering each service data to obtain each service data set, and selecting a service data set satisfying a preset clustering condition from each service data set as an initial target training set, includes:

7. The method of claim 1, wherein extracting a plurality of service data from each of the initial target training sets as initial target training data comprises:

8. The method of claim 7, wherein the extracting the plurality of service data in each of the initial target training groups by the random subsampling method as the initial target training data includes:

9. The method of claim 1, wherein the performing the data splitting and reassembling process on each initial target training data to obtain each target training data includes:

10. A training data generation apparatus, the apparatus comprising:

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.

13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 9.