CN110647995A

CN110647995A - Rule training method, device, equipment and storage medium

Info

Publication number: CN110647995A
Application number: CN201910705620.8A
Authority: CN
Inventors: 陈娴娴; 阮晓雯; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2020-01-03
Also published as: WO2021017293A1

Abstract

The invention relates to the technical field of big data processing, and discloses a rule training method, which comprises the steps of obtaining a corresponding data sample set according to an algorithm scene of data processing of a current data system, repeatedly extracting a model training sample and a model verification sample from the data sample set by adopting a small-batch extraction algorithm, realizing training of data rules based on the two samples and the selected corresponding rule training algorithm to obtain a rule extension model, evaluating and verifying the rule extension model, and selecting a model with the most extensibility as a rule verification module by combining a greedy algorithm; the invention also discloses a rule training device, equipment and a storage medium, which can realize the optimization of the rule, obtain a final rule verification module in the data, further improve the ductility of the analyzed data rule and improve the verification correctness of the final data.

Description

Rule training method, device, equipment and storage medium

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a rule training method, apparatus, device, and computer-readable storage medium.

Background

At present, the accuracy of the final conclusion obtained by classifying and verifying the data in a big data classification and regression model mode is low, the correct prediction of the data classification is low, the continuity and effectiveness of the rule are judged, the current implementation mode is not well realized, and particularly when the data magnitude is more than ten million, the map-reduce under the big data cannot be supported due to the shortage of office resources such as servers of part of small and medium-sized enterprises, or the overall efficiency becomes very slow, the real-time requirement cannot be rapidly supported, so that the mining and verification of part of important rules are limited, and the problems of simple rough deduction, model building and result verification of an analysis modeling mechanism are difficult to perform are caused.

Disclosure of Invention

The invention mainly aims to provide a rule training method, a rule training device, rule training equipment and a computer readable storage medium, and aims to solve the technical problems that rule analysis is inaccurate and ductility of the rule analysis is influenced due to the fact that existing data rule analysis is too single.

In order to achieve the above object, the present invention provides a rule training method, including the steps of:

determining a current algorithm scene, and acquiring data in a current data system based on the algorithm scene to obtain a data sample set;

repeatedly randomly extracting sub-data from the data sample set according to a preset small-batch extraction algorithm, and generating a sample group based on the sub-data, wherein the sample group comprises at least one model training sample group and at least one model verification sample group;

selecting a rule training algorithm corresponding to an algorithm scene according to a corresponding relation between a preset algorithm scene and the rule training algorithm, and performing rule model training according to the rule training algorithm and the model training sample set so as to extract a change rule of sample data from the model training sample set and generate a corresponding rule extension model;

according to the model verification sample group, performing rule ductility evaluation verification on each rule extension model to obtain a verification result, and sequencing each rule extension model according to the verification result;

and selecting a rule extension model meeting extension conditions as a final rule verification model according to the sequencing result and a preset greedy algorithm, wherein the rule verification model is used for analyzing data in the data system, and the greedy algorithm is used for performing verification deduction on the rule extension model.

Optionally, the determining a current algorithm scenario and obtaining data in a current data system based on the algorithm scenario to obtain a data sample set include:

acquiring various algorithms under a current scene, and determining the input data type of the algorithms based on the algorithms;

based on the data type, selecting a data source meeting the data type;

and reading a data set corresponding to the data type according to the data source, and extracting a small data set from the data set in a cyclic random extraction mode to form the data sample set, wherein the small data set at least comprises two kinds of data with different attributes.

Optionally, after the extracting the small data set from the data set by the cyclic random extraction to form the data sample set, the method further includes:

if the data sample set is a multi-dimensional portrait data set, analyzing related information of each data in a small data set extracted from the data set, wherein the related information comprises attributes of the data;

and establishing a multi-dimensional data portrait by taking the attribute of the data as a coordinate label, and taking the data portrait as the data sample set.

Optionally, the repeatedly and randomly extracting sub-data from the data sample set according to a preset small-batch extraction algorithm, and generating a sample group based on the sub-data includes:

setting the number of the sample groups according to the size of the data sample set;

setting the number of subdata in the sample group according to different requirements of the algorithm scene on input data, wherein the requirements comprise time length;

and extracting subdata from the data sample set according to the number of the sample groups, the number of the subdata in the sample groups and the small-batch extraction algorithm to respectively form the model training sample group and the model verification sample group, wherein the small-batch extraction algorithm comprises a simple random sampling method, a hierarchical random sampling method and a clustering random sampling method.

Optionally, the selecting, according to a preset correspondence between an algorithm scene and a rule training algorithm, a rule training algorithm corresponding to the algorithm scene, and performing rule model training according to the rule training algorithm and the model training sample set, so as to extract a change rule of sample data from the model training sample set and generate a corresponding rule extension model includes:

selecting a first rule training algorithm and a second rule training algorithm according to the algorithm scene;

and according to the first rule training algorithm and the second rule training algorithm, respectively taking subdata in the model training sample set as input of the algorithm, and training a rule model to obtain N first rule extension models and M second rule extension models, wherein the value of N, M is less than or equal to the total number of the subdata in the model training sample set.

Optionally, the training of the rule model according to the first rule training algorithm and the second rule training algorithm by using the subdata in the model training sample set as the input of the algorithms respectively includes:

randomly dividing the model training sample group into K packets by adopting a cross validation method, wherein K is a positive integer larger than M;

randomly selecting one of the K packets as a test set, and using the rest K-1 packets as a training set;

and respectively carrying out model training by adopting the first model training algorithm and the second model training algorithm according to the K-1 training sets, and verifying by using a test set to obtain M first rule extension models and M second rule extension models.

Optionally, the verifying the sample group according to the model, performing rule extensibility evaluation verification on each rule extension model to obtain a verification result, and sorting each rule extension model according to the verification result includes:

inputting the model verification sample group as input information into the first rule extension model and the second rule extension model in a one-to-one correspondence manner, and outputting a prediction result of each subdata of the model verification sample group;

respectively scoring the prediction results through a preset scoring model, and respectively sequencing the first rule extension model and the second rule extension model according to a descending order based on the scores to obtain a scoring matrix of the models;

performing summation calculation on transverse scores or longitudinal scores of the scoring matrix to obtain final scoring results of the first rule extension model and the second rule extension model respectively;

the step of selecting the rule extension model meeting the extension condition as a final rule verification model according to the sequencing result and a preset greedy algorithm comprises the following steps:

and selecting n ranked top rules as final rule verification models according to the final scoring results of the first rule extension model and the second rule extension model, wherein n is greater than or equal to 1.

In addition, to achieve the above object, the present invention provides a rule training device, including:

the acquisition module is used for determining a current algorithm scene, acquiring data in a current data system based on the algorithm scene and obtaining a data sample set;

the extraction module is used for repeatedly and randomly extracting the subdata from the data sample set according to a preset small-batch extraction algorithm and generating a sample group based on the subdata, wherein the sample group comprises at least one model training sample group and at least one model verification sample group;

the training module is used for selecting a rule training algorithm corresponding to an algorithm scene according to the corresponding relation between a preset algorithm scene and the rule training algorithm, and performing rule model training according to the rule training algorithm and the model training sample set so as to extract a change rule of sample data from the model training sample set and generate a corresponding rule extension model;

the verification module is used for evaluating and verifying the regular ductility of each regular ductility model according to the model verification sample group to obtain a verification result, and sequencing each regular ductility model according to the verification result;

and the determining module is used for selecting a rule extension model meeting the extension condition as a final rule verification model according to the sequencing result and a preset greedy algorithm, the rule verification model is used for analyzing data in the data system, and the greedy algorithm is used for deducing the verification of the rule extension model.

Optionally, the acquisition module is configured to acquire various algorithms in a current scene, and determine the input data type based on the algorithms; based on the data type, selecting a data source meeting the data type; and reading a data set corresponding to the data type according to the data source, and extracting a small data set from the data set in a cyclic random extraction mode to form the data sample set, wherein the small data set at least comprises two kinds of data with different attributes.

Optionally, the acquisition module is further configured to, if the data sample set is a multidimensional portrait data set, analyze relevant information of each data in a small data set extracted from the data set, where the relevant information includes an attribute of the data; and establishing a multi-dimensional data portrait by taking the attribute of the data as a coordinate label, and taking the data portrait as the data sample set.

Optionally, the extraction module includes a setting subunit and a sample generating subunit;

the setting subunit is configured to set the number of the sample groups according to the size of the data sample set; setting the number of subdata in the sample group according to different requirements of the algorithm scene on input data, wherein the requirements comprise time length;

the sample subunit is configured to extract sub-data from the data sample set according to the number of the sample groups, the number of the sub-data in the sample groups, and the small-batch extraction algorithm, to form the model training sample group and the model verification sample group, respectively, where the small-batch extraction algorithm includes a simple random sampling method, a hierarchical random sampling method, and a clustered random sampling method.

Optionally, the training module includes a selection subunit and a model generation subunit;

the selection subunit is used for selecting a first rule training algorithm and a second rule training algorithm according to the algorithm scene;

and the model generation subunit is configured to perform rule model training according to the first rule training algorithm and the second rule training algorithm by using the subdata in the model training sample set as input of the algorithms respectively to obtain N first rule extension models and M second rule extension models, where a value of N, M is less than or equal to a total number of the subdata in the model training sample set.

Optionally, the model generation subunit is configured to randomly divide the model training sample group into K packets by using a cross validation method, where K is a positive integer greater than M; randomly selecting one of the K packets as a test set, and using the rest K-1 packets as a training set; and respectively carrying out model training by adopting the first model training algorithm and the second model training algorithm according to the K-1 training sets, and verifying by using a test set to obtain M first rule extension models and M second rule extension models.

Optionally, the verification module includes a prediction subunit, a scoring subunit, and a calculation subunit;

the prediction subunit is configured to use the model verification sample group as input information, input the input information into the first rule extension model and the second rule extension model in a one-to-one correspondence manner, and output a prediction result of each sub-data of the model verification sample group;

the scoring unit is used for scoring the prediction results through a preset scoring model respectively, and ranking the first rule extension model and the second rule extension model respectively according to a descending order based on the scores to obtain a scoring matrix of the models;

the calculation subunit is configured to perform summation calculation on the transverse score or the longitudinal score of the scoring matrix to obtain final scoring results of the first rule extension model and the second rule extension model respectively;

the determining unit is used for selecting n ranked top as a final rule verification model according to the final scoring results of the first rule extension model and the second rule extension model, wherein n is greater than or equal to 1.

In addition, to achieve the above object, the present invention also provides a rule training apparatus including: a memory, a processor, and a rule training program stored on the memory and executable on the processor, the rule training program when executed by the processor implementing the steps of the rule training method as recited in any of the above.

Furthermore, to achieve the above object, the present invention provides a computer readable storage medium having a rule training program stored thereon, the rule training program implementing the steps of the rule training method according to any one of the above items when executed by a processor.

The method comprises the steps of obtaining a corresponding data sample set according to an algorithm scene of data processing of a current data system, repeatedly extracting a model training sample and a model verification sample from the data sample set by adopting a small-batch extraction algorithm, realizing training of data rules based on the two samples and the selected corresponding rule training algorithm to obtain a rule extension model, evaluating and verifying the rule extension model, and selecting a model with the highest extensibility as a rule verification module by combining a greedy algorithm; the mass data are divided into a plurality of subsets through the method, then deduction analysis is carried out on the verification rules of the subsets according to some algorithms (such as greedy algorithm), the commonality of the verification rules among the subsets is obtained, the continuity level of the verification rules is calculated, the optimization of the rules is achieved, the final rule verification module in the data is obtained, the ductility of the analyzed data rules is further improved, and the verification accuracy of the final data is also improved.

Drawings

FIG. 1 is a schematic flow chart of a rule training method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a rule training method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a server according to an embodiment of the present invention;

fig. 4 is a schematic diagram of functional modules of an embodiment of a rule training apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a rule training method, which is mainly a method for verifying the ductility of a data change rule obtained by analysis when a data processing device analyzes and processes big data.

The mechanism is characterized in that a batch size algorithm repeatedly and randomly extracts a subset of full-volume data for verification in a mass data verification set, and an algorithm for extracting a small data set through random sampling can continuously convert a large data set into a small data set for verification, so that the verification complexity is reduced, and the efficiency is improved.

In this embodiment, the physical implementation system of the method may be a Personal Computer (PC), a smart phone, or other devices, but the device has a data processing function, especially a database device or a server device, and these devices often analyze and process big data. Based on the hardware of the data storage device, various embodiments of the rule training method are provided.

Referring to fig. 1, fig. 1 is a flowchart of a rule training method according to an embodiment of the present invention. In this embodiment, the rule training method specifically includes the following steps:

s10, determining a current algorithm scene, and acquiring data in a current data system based on the algorithm scene to obtain a data sample set;

in this step, the collection of the data sample set may be directly collected according to the requirements of the user, or may be obtained according to input data of various algorithms, and when collecting the sample, the data sample set may be randomly collected according to the set sample size, different sample sizes may be set in different scenes, and the data sample set may be specifically set according to the size of the entire data volume.

In practical application, the data sample set can be embodied by a multidimensional portrait, the multidimensional portrait refers to multiple considerations, and the data sample set is created by taking non-dimensional coordinates of different considerations; further, the algorithm scenario may be understood as a type of data, in practical applications, different data processing algorithms may exist for different data, different analysis rules may exist for different data processing algorithms, and the corresponding data sample set is obtained by determining the algorithm scenario first, and at the same time, a subsequent data rule analysis mode is also determined, for example, what model is selected to analyze the data, so as to obtain a corresponding data rule model.

Step S20, repeatedly and randomly extracting sub-data from the data sample set according to a preset small batch extraction algorithm, and generating a sample group based on the sub-data, wherein the sample group comprises at least one model training sample group and at least one model verification sample group;

in this embodiment, when a sample group is generated, the conditions of dividing the sample group according to different sample sizes are different, and reasonable division can be performed on the sample group which simultaneously implements model training and verification in the same data sample set, so that the steps of obtaining sample data for many times can be reduced, the processing times of data can be reduced, and the effectiveness of the data can be effectively utilized.

Of course, in the present practical application, if the acquisition of two sample groups cannot be realized in the same data sample set, the problem of insufficient data when the sample is too small can be solved by the repeated acquisition and cyclic utilization.

Preferably, the batch size algorithm is selected to repeatedly randomly extract the data subset from the verification set, the batch size is used as a sample extraction standard, a model training sample is formed by extracting sample data of the batch size from the data sample set, and a model verification sample can be extracted according to the batch size.

Furthermore, the number of samples sampled each time is limited according to the size of the whole data set, and in order to improve the efficiency of training and verification, and the training algorithm is also input in a small data set manner, the number of samples is generally controlled to be 500-1000 samples to form a group of verification groups. And selecting 10-100 groups for verification according to different requirements and time limits.

For the sampling extraction process, a random sampling investigation method can be preferably adopted, and a simple random sampling method, a layered random sampling method and a grouping random sampling method are respectively selected according to the requirements of different algorithm scenes. For example, when there is an explicit feature hierarchy in the tag data, we control the sampling probability of each hierarchy to be the same during the extraction, and the clustered sampling can be obtained by the same method.

Step S30, selecting a rule training algorithm corresponding to an algorithm scene according to the corresponding relation between a preset algorithm scene and the rule training algorithm, and performing rule model training according to the rule training algorithm and the model training sample set so as to extract the change rule of sample data from the model training sample set and generate a corresponding rule extension model;

in this embodiment, preferably, at least two model training algorithms are simultaneously selected, each sample group includes a plurality of small sample sets, each small sample set is trained by using at least two model training algorithms to obtain at least two models, and thus, after the training of the sample group is completed, each model has a plurality of models.

Step S40, according to the model verification sample group, performing rule extensibility evaluation verification on each rule extension model to obtain a verification result, and sequencing each rule extension model according to the verification result;

in the embodiment, the model and the sample data are continuously graded through a greedy algorithm, an optimal result is screened out, and then verification and evaluation are performed, so that the accuracy of model sequencing is realized.

In this embodiment, taking the selection of two model algorithms as an example for explanation, preferably an XGboost model and a LightGBM model, respectively, after a plurality of XGboost models and LightGBM models are created through training in step S30, model verification samples are used as input data of the two models and are output to the models respectively, so as to output corresponding model values, and if the model values are used in a prediction scene, the model values are predicted values of certain data or quantity, and then the prediction is scored, so as to obtain a corresponding ranking of each model.

In this embodiment, the greedy algorithm may be applied to model verification sample sets, or may be applied to a verified model, for example, a small sample set is selected from the sample sets to perform verification on the model, a part of the model with an earlier sequence is selected, and after the sample sets are verified and the score is found to meet a standard, a better model is output.

And S50, selecting a rule extension model meeting extension conditions as a final rule verification model according to the sequencing result and a preset greedy algorithm, wherein the rule verification model is used for analyzing data in the data system, and the greedy algorithm is used for deducting the verification of the rule extension model.

In this step, models are generally selected according to actual use conditions, it is assumed that we have not only selected LGBM and XGBOOST, but in a general industrial scenario, we may have selected tens of models, and in verification, in order to speed up, we may select only the optimal model in the ranking each time, and we may also expand 4 models that are optimal in the ranking each time to record, and finally verify all sample groups. And obtaining an optimal model.

The continuous verification of the rules of the big data is realized through the mode, and the massive data set can be quickly decomposed. And secondly, based on a greedy algorithm, rule verification and search are simply and quickly carried out, and a local optimal search mechanism is quickly carried out. And finally, searching a path based on a greedy algorithm, determining regular continuity rating according to the data volume proportion, and greatly improving the verification accuracy.

In this embodiment, for the determining of the current algorithm scenario, and obtaining data in the current data system based on the algorithm scenario, obtaining the data sample set may specifically be implemented in the following manner:

based on the data type, selecting a data source meeting the data type;

In practical application, in the same data system, data with various attributes can exist at the same time, and each data has a corresponding algorithm to realize data analysis, so that a corresponding data rule is finally obtained, and when the data is obtained for analysis, the input data type can be determined based on the algorithms by obtaining various algorithms in the current scene; determining each source of the data according to the data type; reading a data set corresponding to the data type from each data source, and extracting required data from the data set through a sample extraction rule to form the data sample group; wherein the extracted data comprises data of at least two different attributes.

That is, input data of algorithms corresponding to different application scenarios are specified, and on the basis of determining a short and happy hair, the type of large data can be actually determined, and based on the type of the data, some data systems can be selected to acquire the data.

In this embodiment, a plurality of presentation implementations of the data sample set may be performed, and when the data sample set in this embodiment is embodied in the form of a multidimensional image data set, after extracting a small data set from the data set by cyclic random extraction to form the data sample set, the method further includes:

analyzing relevant information of each data in a small data set extracted from the data set, wherein the relevant information comprises data attributes;

In practical application, when an algorithm scene is assumed to be a prediction scene for predicting influenza ILI indexes, when verification of a prediction model and weekly prediction of the number of pathogenic people for treating diseases according to the prediction model are carried out, when sample data are obtained, historical data of the pathogenic diseases and data of other public opinions or network news and the like are obtained to represent the pathogenic statistics of the diseases, specifically, influenza ILI index data of historical M years are obtained in a cooperation process with governments and disease control centers, X0 dimensionality is obtained through some simple derivative algorithms, and weather related data of historical M years or even more, including portrait information of weather dimensionalities such as day and night temperature, air pressure, precipitation, air index, humidity and the like, such as X1 dimensionality, is obtained through purchasing data of a weather bureau; meanwhile, the indexes of various hot websites, mainly public sentiment indexes such as data of public sentiment portrait dimensions such as sneezing, nasal obstruction, headache and fever, are crawled by the crawler, and the public sentiment indexes are X2 dimensions. In each scene, the large data can be collected through cooperation or some technical means. Thus, in this scenario, demand-based is a prediction of future weekly flu ILI index. The portrait dimension relates to ILI historical portrait dimension, weather data portrait dimension, public opinion data portrait dimension, etc. of each week of historical M years. Therefore, we finally obtained a (52 × M) × (X0+ X1+ X2+ …) image, where the transverse direction is (52 × M), i.e., (52 × M) cycles, and the longitudinal direction is γ × (X0+ X1+ X2+ …), and γ is the derived dimensionality reduction factor of some feature engineering algorithms.

In this embodiment, when the data processing system processes data, there may be a difference in size of the data amount, and there may be a difference in processing manner for different data amounts, and the acquired sample set is also adaptively reduced or increased, so the implementation of step S20 may also be:

extracting subdata from the data sample set according to the number of the sample groups, the number of subdata in the sample groups and the small-batch extraction algorithm to respectively form the model training sample group and the model verification sample group; the small-batch extraction algorithm comprises a simple random sampling method, a hierarchical random sampling method and a clustering random sampling method.

In this embodiment, the setting of various parameters of the sample group and the extraction of the sample group may be specifically realized based on a Batchsize algorithm, and in deep learning, generally, SGD training is adopted, that is, a training set is trained on Batchsize samples each time. The method is realized through Mini-batch, and the idea of Mini-batch is realized by using an algorithm with higher operating efficiency provided by the batch gradient and stochastic gradient schemes for reference, so that the defect of violent and unstable oscillation is avoided to a certain extent.

In this embodiment, in order to improve the accuracy of training in rule verification, two model training methods are generally selected for training the model at the same time, and based on this, step S30 includes:

and according to the first rule training algorithm and the second rule training algorithm, respectively taking subdata in the model training sample set as the input of the algorithm, and training a rule model to obtain N first rule extension models and M second rule extension models, wherein the values of N and M are equal to the number of the subdata in the model training sample set.

In practical applications, the models commonly used are XGboost and LightGBM models, 500 samples are randomly extracted from big data into one sample group (here, layered or grouped samples may be extracted according to a scene change sampling method) by a random extraction algorithm, and more than 10 sample groups are randomly extracted. For example, XGBOSt and LightGBM are used for fast model training, and two models with 10 different parameters are finally obtained.

Further, when the rule model training is performed, the sample group may be processed by combining with other methods, so as to improve the efficiency of the model training, and optionally, according to the first model training algorithm and the second model training algorithm, the training of the rule model may be performed by using data in the model training sample group as input of the algorithms respectively, including:

In practical application, taking a 5-fold cross validation method as an example, a training set is randomly divided into 5 packets, that is, sub-training sets 1, 2, 3, 4, and 5 are respectively used as test sets, and model training is performed by using other sub-training sets except the test sets. Taking the sub-training set 1 as the test set as an example, then training by using the sub-training sets 2, 3, 4, 5 and the label 1 corresponding to the test set to obtain the model 1, and training the other four models by the same method.

In this embodiment, when the training of the rule extension model is implemented by the two rule training algorithms, the rule extension model is evaluated and verified for rule extensibility according to the model verification sample group to obtain a verification result, and the ranking of the rule extension model according to the verification result may be implemented specifically by the following steps:

inputting the model verification sample group serving as input information into the first rule extension model and the second rule extension model in a one-to-one correspondence mode, and outputting prediction results of sub data of the model verification sample group;

respectively scoring the prediction results according to a preset scoring model, and respectively sequencing the first rule extension model and the second rule extension model from large to small on the basis of the scores to obtain a scoring matrix of the models;

at this time, the selecting, according to the sorting result and a preset greedy algorithm, the rule extension model that meets the extension condition as a final rule verification model includes:

In this embodiment, the trial implementation step may be understood as a deduction verification process of a verification rule of the big data verification model, for which the rule extensibility of the big data embodied in the training model is scored and verified on the basis of the created training model, specifically as described in the following embodiments,

100 samples are extracted from the data sample group into a sample group through a sample random extraction algorithm, and more than 100 sample groups are randomly extracted to be used as verification diversity. And validated using models XG1, XG2, …, XG10 and LGBM1, LGBM2, …, LGBM10 trained by the model training sample set. Finally, it is verified that each sample set Yi has an error rate ranking with the 20 models having the highest score (which can be switched according to the scene, for example, in the Y1 sample set, the ranking is LGBM1, LGBM2, …, XGB10, …, XGB1, and the scores of the 20 models in Y1 are 20-1). Finally, a similar scoring matrix is obtained, and the matrices are summed transversely to obtain the final score of each model (which may be switched according to the scene, such as averaging after removing extreme values), thereby completing the analysis from modeling to final verification.

Of course, this is only a specific example, and the deduction mechanism is also applied to scenarios such as big data analysis. The overall algorithm mechanism and modeling are similar, with the difference that one is to build a model, verify the results based on the trained model, and one is to analyze and refine the rules, add and verify according to the summarized rules.

In the above deduction process, we have applied greedy selection, which means that the overall optimal solution of the solved problem can be achieved by a series of local optimal selections, i.e. greedy selection. The method is the first basic element of the greedy algorithm and is the main difference between the greedy algorithm and the dynamic programming algorithm. The greedy selection is to make successive selection from top to bottom in an iterative method, and the problem to be solved is simplified into a sub-problem with smaller scale every time the greedy selection is made. For a particular problem, to determine whether it has the property of greedy selection, we must prove that the greedy selection made at each step ultimately yields the best solution to the problem. Usually, an overall optimal solution of the problem can be firstly proved, which starts from greedy selection, and after greedy selection, the original problem is simplified into a similar subproblem with smaller scale. Then, a mathematical induction method is used for proving that an overall optimal solution of the problem can be finally obtained through greedy selection at each step.

Assuming that we not only select LGBM and XGBOOST, we may select tens of models in a general industrial scenario, and during verification, to speed up, we may select only the optimal model in the ranking each time, and we may also expand to select the optimal 4 models in the ranking each time for recording, and finally verify all sample groups. And obtaining an optimal model.

In summary, the rule verification method provided in the embodiment of the present invention is a rule rolling verification mechanism based on a combination of the batch size and the greedy algorithm, and the mechanism is implemented by repeatedly and randomly extracting a subset of the full data from the batch size algorithm for verification in a mass data verification set, and we can continuously convert a large data set into a small data set for verification through an algorithm for extracting a small data set by random sampling, so as to reduce the complexity of verification and improve the efficiency.

Also, we apply the batch size in the rule extraction. Secondly, a greedy algorithm is adopted, deduction is carried out by applying the greedy algorithm in the screened rule subclasses, and the optimal rule serial number under each path is recorded. And finally, traversing all the samples to obtain a rule serial number data set, and obtaining the final rule rating by adopting layering according to the ratio of the rule serial number to n batchs obtained from the full data. Namely, the larger the ratio of the rule serial number to the n batchs in the full data is, the stronger the rule ductility is, so that the optimization of the rule is realized, and the verification accuracy of the final data is improved.

As shown in fig. 2, a flowchart is a specific implementation flowchart of the rule training method provided in the embodiment of the present invention, and the embodiment describes the method in detail with reference to a specific implementation scenario, where the method specifically includes the following steps:

step S210, obtaining mass data to be verified to form a verification set of the data;

in this step, the acquired mass data refers to input data based on various algorithm scenarios. For example: when the number of people with weekly flu ILI indexes is predicted, through cooperation with governments and disease control centers, flu ILI index data of historical M years are obtained, X0 dimensionality is obtained through some simple derivative algorithms, and through purchasing data of a meteorological bureau, weather related data of historical M years or even more years, including portrait information of meteorological dimensions such as day and night temperature, air pressure, precipitation, air index, humidity and the like, for example, the X1 dimensionality is obtained; meanwhile, the indexes of various hot websites, mainly public sentiment indexes such as data of public sentiment portrait dimensions such as sneezing, nasal obstruction, headache and fever, are crawled by crawlers, and the public sentiment indexes are X2 dimensions. In each scene, the large data can be collected through cooperation or some technical means. Thus, in this scenario, our need is a prediction of the future weekly flu ILI index. The portrait dimension refers to the ILI historical portrait dimension, the weather data portrait dimension, the public opinion data portrait dimension, and the like, of each week of the historical M years. Therefore, we finally obtained a (52 × M) × (X0+ X1+ X2+ …) image, where the transverse direction is (52 × M), i.e., (52 × M) cycles, and the longitudinal direction is γ × (X0+ X1+ X2+ …), and γ is the derived dimensionality reduction factor of some feature engineering algorithms.

Step S220, repeatedly and randomly extracting a data subset from the verification set through a batch size algorithm;

specifically, the number of samples sampled each time is limited according to the size of the whole verification set, and is generally controlled to be 500-1000 samples to form a group of verification groups. And selecting 10-100 groups for verification according to different requirements and time limits.

In the extraction process, a random sampling investigation method is adopted, and a simple random sampling method, a layered random sampling method and a grouped random sampling method are respectively selected according to the requirements of different algorithm scenes. For example, when there is an explicit feature hierarchy in the tag data, we control the sampling probability of each hierarchy to be the same during the extraction, and the clustered sampling can be obtained by the same method.

Step S230, performing deduction analysis of the data verification rule based on all the extracted data subsets;

the steps mainly realize the effects of simple rough deduction, model building and result verification of an analysis modeling mechanism under the condition of mass data, and the following modeling mechanism is taken as an example:

firstly, 500 samples are randomly extracted to form a sample group, specifically, layered or grouped sampling can be performed according to a scene change sampling method, and more than 10 sample groups are randomly extracted. For example, XGBOSt and LightGBM are used for fast model training, and two models with 10 different parameters are finally obtained.

Secondly, the subsequent sampling work is continued. 100 samples are extracted as one sample group, and more than 100 sample groups are randomly extracted as verification diversity. We performed validation with the previous XG1, XG2, …, XG10 and LGBM1, LGBM2, …, LGBM 10. Finally, it is verified that each sample set Yi has an error rate ranking with the 20 models having the highest score (which can be switched according to the scene, for example, in the Y1 sample set, the ranking is LGBM1, LGBM2, …, XGB10, …, XGB1, and the scores of the 20 models in Y1 are 20-1). Finally, a similar scoring matrix is obtained, and the matrices are summed transversely to obtain the final score of each model (which may be switched according to the scene, such as averaging after removing extreme values), thereby completing the analysis from modeling to final verification.

Of course, this is only a specific example, and the deduction mechanism can also be applied to the scenes of big data analysis and the like. The overall algorithm mechanism and modeling are similar, with the difference that one is to build a model, verify the results based on the trained model, and one is to analyze and refine the rules, add and verify according to the summarized rules.

In the above deduction process, greedy selection has been applied, and the greedy selection means that the overall optimal solution of the solved problem can be achieved through a series of local optimal selections, namely, greedy selection. The method is the first basic element of the greedy algorithm and is the main difference between the greedy algorithm and the dynamic programming algorithm. The greedy selection is to make successive selection from top to bottom in an iterative method, and the problem to be solved is simplified into a sub-problem with smaller scale every time the greedy selection is made. For a particular problem, to determine whether it has the property of greedy selection, it must be demonstrated that the greedy selection made at each step ultimately results in an optimal solution to the problem. Usually, an overall optimal solution of the problem can be firstly proved, which starts from greedy selection, and after greedy selection, the original problem is simplified into a similar subproblem with smaller scale. Then, a mathematical induction method is used for proving that an overall optimal solution of the problem can be finally obtained through greedy selection at each step.

Under the assumption that not only LGBM and XGBOST are selected, dozens of models can be selected in a general industrial scene, and during verification, only the optimal model in the sequencing can be selected each time for speeding up, the optimal 4 models in the sequencing can be expanded each time for recording, and finally all sample groups are verified. And obtaining an optimal model.

Step S240, carrying out continuity rating on the data verification rule according to the result of deduction analysis;

specifically, after 100 rules are refined or a rule is refined and divided into 100 adding degrees γ 1 to γ 100, a scoring matrix is finally obtained according to 100 sampling groups. In scoring, an arithmetic series may be used for scoring, and a classification rating may be used for scoring as well. For example, the error is 10 minutes in the range of 0.01 to 0.05, and 2 in the range of 0.05 to 0.10.

And S250, selecting a final model as a rule verification model of data analysis according to the continuity rating.

In this embodiment, through implementation of the above scheme, based on a rule rolling verification mechanism in which the batch size and the greedy algorithm are combined, repeated random sampling can be performed on the data set and the rule based on the batch size, and the massive data set can be quickly decomposed. And secondly, based on a greedy algorithm, rule verification and search are simply and quickly carried out, and a local optimal search mechanism is quickly carried out. And finally, searching a path based on a greedy algorithm, determining regular continuity rating according to the data volume proportion, quickly and effectively, and stabilizing the verification accuracy to more than 80%.

In order to solve the above problem, the present invention further provides a rule training device, where the rule training device may be configured to implement the rule training method provided in the embodiment of the present invention, and a physical implementation of the rule training device exists in a manner of a server, and a specific hardware implementation of the server is shown in fig. 3.

Referring to fig. 3, the server includes: a processor 301, e.g. a CPU, a communication bus 302, a user interface 303, a network interface 304, a memory 305. Wherein a communication bus 302 is used to enable the connection communication between these components. The user interface 303 may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 304 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface). Memory 305 may be a high-speed RAM memory or may be a non-volatile memory, such as a disk memory. The memory 305 may alternatively be a storage device separate from the processor 301 described above.

Those skilled in the art will appreciate that the hardware configuration of the apparatus shown in FIG. 3 does not constitute a limitation of the rule training device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 3, the memory 305, which is a type of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a rule-based training program. Among other things, the operating system is a program that manages and trains the devices and software resources, the operation of the rule training program, and other software and/or programs.

In the hardware configuration of the server shown in fig. 3, the network interface 104 is mainly used for accessing a network; the user interface 103 is mainly used for obtaining data from a data generating terminal connected to the external internet to form a data sample set, and transmitting the data sample set to the processor 301, and the processor 301 may be used for calling a rule training program stored in the memory 305 and executing the following operations of various embodiments of the rule training method.

In this embodiment of the present disclosure, the implementation of fig. 3 may also be a mobile terminal such as a mobile phone and the like that can be operated by touch, and a processor of the mobile terminal trains the rule of the big data by reading the program code that is stored in the buffer or the storage unit and that can implement the rule training method, so as to obtain a model with more accurate extensibility.

In order to solve the above problem, an embodiment of the present invention further provides a rule training apparatus, and referring to fig. 4, fig. 4 is a schematic diagram of functional modules of the rule training apparatus provided in the embodiment of the present invention. In this embodiment, the apparatus comprises:

the acquisition module 41 is configured to determine a current algorithm scenario, and obtain data in a current data system based on the algorithm scenario to obtain a data sample set;

an extraction module 42, configured to repeatedly and randomly extract sub-data from the data sample set according to a preset small-batch extraction algorithm, and generate a sample group based on the sub-data, where the sample group includes at least one model training sample group and at least one model verification sample group;

the training module 43 is configured to select a rule training algorithm corresponding to the current algorithm scenario according to a correspondence between the algorithm scenario and the rule training algorithm, and perform rule model training according to the rule training algorithm and the model training sample set to obtain a rule extension model;

the verification module 44 is configured to perform rule extensibility evaluation verification on the rule extension model according to the model verification sample group to obtain a verification result, and sort the rule extension model according to the verification result;

and the determining module 45 is configured to select a rule extension model meeting the extension condition as a final rule verification model according to the sequencing result and a preset greedy algorithm, where the rule verification model is used to analyze data in the data system, and the greedy algorithm is used to deduce verification of the rule extension model.

Based on the same embodiment as the rule training method in the above embodiment of the present invention, the content of the embodiment of the rule training device is not described in detail in this embodiment.

According to the method, the rule rolling verification mechanism based on the combination of the small-batch extraction algorithm and the greedy algorithm is adopted, repeated random sampling can be performed on the data set and the rules based on the small-batch extraction algorithm, and the massive data set can be rapidly decomposed. And secondly, based on a greedy algorithm, rule verification and search are simply and quickly carried out, a local optimal search mechanism is quickly carried out, a rule verification model with higher continuity and ductility is obtained most, and data analysis is carried out based on the rule obtained by the model.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium has a rule training program stored thereon, and the rule training program, when executed by a processor, implements the steps of the rule training method as described in any one of the above embodiments. The method implemented by the rule training program when executed by the processor may refer to various embodiments of the rule training method of the present invention, and therefore, the details are not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A method of rule training, the method comprising the steps of:

2. The rule training method of claim 1, wherein the determining a current algorithm scenario and obtaining data in a current data system based on the algorithm scenario, and obtaining a data sample set comprises:

based on the data type, selecting a data source meeting the data type;

3. The rule training method of claim 2, wherein after said extracting a small data set from said data set by cyclic random extraction to form said data sample set, further comprises:

4. The rule training method of claim 3, wherein the repeating of randomly extracting sub-data from the set of data samples according to a pre-defined small batch extraction algorithm and generating a sample set based on the sub-data comprises:

5. The rule training method according to claim 4, wherein the selecting a rule training algorithm corresponding to the algorithm scenario according to a preset correspondence between the algorithm scenario and the rule training algorithm, and performing rule model training according to the rule training algorithm and the model training sample set, so as to extract variation rules of sample data from the model training sample set and generate a corresponding rule extension model comprises:

6. The rule training method of claim 5, wherein the training of the rule model according to the first rule training algorithm and the second rule training algorithm with the subdata in the model training sample set as the input of the algorithm respectively comprises:

7. The method of claim 6, wherein the validating the sample set according to the model, performing rule extensibility evaluation validation on each of the rule extension models to obtain a validation result, and the sorting each of the rule extension models according to the validation result comprises:

the step of selecting the rule extension model meeting the extension condition as the final rule verification model according to the sequencing result and the greedy algorithm comprises the following steps:

8. A rule training device, characterized in that the rule training device comprises:

the extraction module is used for repeating the data sample set to randomly extract subdata according to a preset small-batch extraction algorithm and generating a sample group based on the subdata, wherein the sample group comprises at least one model training sample group and at least one model verification sample group;

and the determining module is used for selecting a rule extension model meeting the extension condition as a final rule verification model according to the sequencing result and a preset greedy algorithm, the rule verification model is used for analyzing data in the data system, and the greedy algorithm is used for deducting the verification of the rule extension model.

9. A rule training device, characterized in that the rule training device comprises: memory, a processor, and a rule training program stored on the memory and executable on the processor, the rule training program when executed by the processor implementing the steps of the rule training method of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon a rule training program, which when executed by a processor, implements the steps of the structured validation method recited in any of claims 1-7.