CN112819034A - Data binning threshold calculation method and device, computer equipment and storage medium - Google Patents

Data binning threshold calculation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112819034A
CN112819034A CN202110036327.4A CN202110036327A CN112819034A CN 112819034 A CN112819034 A CN 112819034A CN 202110036327 A CN202110036327 A CN 202110036327A CN 112819034 A CN112819034 A CN 112819034A
Authority
CN
China
Prior art keywords
data
user
user data
training
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110036327.4A
Other languages
Chinese (zh)
Inventor
牛犇
张莉
陈弘
吴志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110036327.4A priority Critical patent/CN112819034A/en
Publication of CN112819034A publication Critical patent/CN112819034A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Finance (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Technology Law (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a data binning threshold calculation method, a data binning threshold calculation device, computer equipment and a storage medium, wherein the method comprises the following steps: carrying out standardization processing on the first user data according to the data distribution of the plurality of first user data under each first data field; generating a standard data vector of each user according to standard data obtained by standardization processing and extracting features to obtain a plurality of feature data; training an integrated prediction model by taking a plurality of feature data as training data and a plurality of second user data under corresponding second data fields as training targets; acquiring a plurality of target characteristic data of a specified layer of an integrated prediction model, and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters; and determining a second user data cluster corresponding to each target characteristic data cluster, and calculating a plurality of data binning thresholds according to the intersection of the plurality of second user data clusters. The invention can quickly calculate the data binning threshold and update the data binning threshold in real time.

Description

Data binning threshold calculation method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data binning threshold calculation method and device, computer equipment and a storage medium.
Background
Insurance companies typically perform performance assessment on agents based on their performance compared to a plurality of data binning thresholds to determine whether to award or raise the level of the agent.
The inventor finds that, in the process of implementing the invention, most of the existing multiple data binning thresholds are calculated by leaders in a statistical manner, however, in the face of thousands of performance data, the calculation amount in the statistical manner is large, and especially when the performance data is more and more, the original data binning thresholds may no longer be suitable for new performance data, so that the performance thresholds need to be calculated in the statistical manner again, which results in low calculation efficiency of the data binning thresholds; and the statistical mode only calculates the data binning threshold value according to the performance data, and does not consider the internal logic between the first user data and the performance data of the user, so the data binning determined by the statistical mode lacks basis and has low reliability.
Disclosure of Invention
In view of the foregoing, there is a need for a method, an apparatus, a computer device and a storage medium for calculating a data binning threshold, which can calculate the data binning threshold quickly and update the data binning threshold in real time.
A first aspect of the present invention provides a data binning threshold calculation method, including:
determining data distribution of a plurality of first user data under each first data field, and performing standardization processing on the plurality of first user data under the corresponding first data field according to the data distribution;
generating a standard data vector of each user according to the standard data obtained by standardization, and performing feature extraction on the standard data vector of each user to obtain a plurality of feature data;
taking the plurality of characteristic data as training data, taking a plurality of second user data under corresponding second data fields as training targets, and training an integrated prediction model;
acquiring a plurality of target characteristic data of a specified layer of the integrated prediction model, and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters;
and determining a second user data cluster corresponding to each target characteristic data cluster, and calculating a plurality of data binning thresholds according to the intersection of the second user data clusters.
In an optional embodiment, the determining a second user data cluster corresponding to each target feature data cluster, and calculating a plurality of data binning thresholds according to an intersection of the plurality of second user data clusters includes:
determining second user data corresponding to the target characteristic data in each target characteristic data cluster to obtain a corresponding second user data cluster;
calculating central data in each second user data cluster, and sequencing a plurality of second user data clusters according to the central data;
combining every two adjacent second user data clusters in the sorted second user data clusters to obtain a second user data cluster pair;
and calculating a plurality of data binning thresholds according to a plurality of second user data cluster pairs.
In an alternative embodiment, said calculating a plurality of data binning thresholds from a plurality of said second user data cluster pairs comprises:
when a data intersection exists between two second user data clusters in a second user data cluster pair, determining a left data intersection point and a right data intersection point in the data intersection; calculating a data binning threshold for the second user data cluster pair from the left data intersection and the right data intersection;
and when the two second user data clusters in the second user data cluster pair do not have data intersection, calculating the data binning threshold of the second user data cluster pair according to the central data in the two second user data clusters.
In an optional embodiment, the normalizing the first user data under the corresponding first data field according to the data distribution includes:
when the data distribution of first user data under a first data field is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data field;
and when the data distribution of the first user data in the first data field is continuous data distribution, performing power operation on the first user data by using a preset power function to obtain a plurality of standard data in the first data field.
In an optional embodiment, the extracting the features of the standard data vector of each user to obtain a plurality of feature data includes:
extracting a plurality of cross features based on each standard data vector using an attention factorizer;
extracting a plurality of high-order features based on each standard data vector by using a multilayer perceptron;
generating a combined feature according to the plurality of combined features and the corresponding plurality of high-order features;
and carrying out normalization processing on the plurality of combined features to obtain a plurality of feature data.
In an optional embodiment, the training of the integrated predictive model with the plurality of feature data as training data and the plurality of second user data under the corresponding second data fields as training targets includes:
inputting the plurality of feature data into a pre-training model, and extracting a feature vector of each feature data through the pre-training model;
inputting a plurality of feature vectors into a Bi-LSTM model for training;
acquiring a plurality of prediction data output by the Bi-LSTM model;
calculating residuals between a plurality of second user data under the second data field and the plurality of prediction data;
and training the Bi-LSTM model based on the residual error by adopting a forward feedback mechanism to obtain an integrated prediction model.
In an optional embodiment, the method further comprises:
acquiring a plurality of target characteristic data of a user to be detected according to the first data field;
predicting by using the integrated prediction model based on the target feature data to obtain prediction data of the user to be tested under the second data field;
comparing the predicted data under the second data field of the user to be tested with the plurality of data binning thresholds;
and determining the grade of the user to be tested according to the comparison result obtained by the comparison.
A second aspect of the present invention provides a data binning threshold calculation apparatus, the apparatus comprising:
the processing module is used for determining the data distribution of the plurality of first user data under each first data field and carrying out standardization processing on the plurality of first user data under the corresponding first data field according to the data distribution;
the extraction module is used for generating a standard data vector of each user according to the standard data obtained by the standardization processing, and performing feature extraction on the standard data vector of each user to obtain a plurality of feature data;
the training module is used for training the integrated prediction model by taking the plurality of characteristic data as training data and taking a plurality of second user data under corresponding second data fields as training targets;
the clustering module is used for acquiring a plurality of target characteristic data of a specified layer of the integrated prediction model and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters;
and the calculation module is used for determining a second user data cluster corresponding to each target characteristic data cluster and calculating a plurality of data binning thresholds according to the intersection of the second user data clusters.
A third aspect of the invention provides a computer apparatus comprising a processor for implementing the data binning threshold calculation method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data binning threshold calculation method.
In summary, the data binning threshold calculation method, apparatus, computer device and storage medium according to the present invention, different normalization processes are applied depending on the data distribution of the first user data under different first data fields, generating standard data vector according to the standard data obtained by standardization, extracting the features of the standard data vector of each user to obtain multiple feature data, then, the plurality of feature data are used as training data, a plurality of second user data under corresponding second data fields are used as training targets to train the integrated prediction model, so that a plurality of target feature data of a specified layer of the integrated prediction model are obtained, and clustering a plurality of second user data under a second data field according to clustering of a plurality of target characteristic data, and finally calculating a plurality of data binning thresholds according to the intersection of the second user data clusters obtained by clustering. The method combines the first user data of the user, and quickly determines the optimal data binning threshold value from a plurality of second user data. Even if the second user data changes along with the change of time, the data binning threshold value can be updated in real time and rapidly only by updating the integrated prediction model in an iteration mode, and therefore the data binning threshold value is the optimal choice under objective conditions.
Drawings
Fig. 1 is a flowchart of a method for calculating a data binning threshold according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a data binning threshold calculation apparatus according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The data binning threshold calculation method provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the data binning threshold calculation device runs in the computer equipment.
Fig. 1 is a flowchart of a method for calculating a data binning threshold according to an embodiment of the present invention. The data binning threshold calculation method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.
And S11, determining the data distribution of the plurality of first user data under each first data field, and normalizing the plurality of first user data under the corresponding first data field according to the data distribution.
Each user corresponds to a plurality of first data fields, which may include, but are not limited to: local economic data, basic data, business data, etc. Wherein the local economic data may include: GDP, population size, etc., the underlying data may include: academic calendar, working years, etc., the business data may include: the turnover is averaged daily, the number of people is increased daily, and the like. The first user data field is only an example and is not used as any limitation to the present invention, and the first user data field may be determined according to an actual application scenario, so as to determine the first user data.
The first user data may be extracted from a database internal to the enterprise.
The plurality of first user data under different data fields have different data distributions, and the first user data under the corresponding first data field are subjected to standardized processing according to the data distributions, so that the oriented processing of the first user data under different first data fields can be realized.
In an optional embodiment, the normalizing the first user data under the corresponding first data field according to the data distribution includes:
when the data distribution of first user data under a first data field is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data field;
and when the data distribution of the first user data in the first data field is continuous data distribution, performing power operation on the first user data by using a preset power function to obtain a plurality of standard data in the first data field.
Some of the first user data under the first data field may be discrete (e.g., school calendar, age), and some of the first user data under the first data field may be continuous (e.g., average daily turnover, average daily increment), so that the first user data needs to be standardized according to data distribution.
In specific implementation, the computer device first obtains first user data of a plurality of users belonging to the same first data field, then determines data distribution of the plurality of first user data in each first data field, and adopts different standardized processing modes for the plurality of first user data in the first data field according to the data distribution.
In order to solve the problem that discrete data are missing, when the data distribution of first user data corresponding to a certain first data field is discrete data distribution, function fitting is performed according to the existing first user data, and then the missing data is fitted by using a fitting function, so that the missing data obtained by fitting is used for data filling, and the integrity of the first user data is ensured. For example, for the first data field of age, the age data of 10 users are 21, 22, 23, 24, 26, 27, 28, 29, 30, respectively, indicating that the age data of the 5 th user is missing, 9 age data of (1, 21), (2, 22), (3, 23), (4, 24) (6, 26), (7, 27), (8, 28), (9, 29), (10, 30) are subjected to function fitting, and then the argument 5 is substituted into the fitting function obtained by fitting, so as to obtain missing data 25.
When the data distribution of the first user data corresponding to a certain first data field is a continuous data distribution, a power function, for example, a log function, may be used to perform a power operation on each first user data in the first data field, so as to achieve the purpose of discretizing the continuous data.
In other embodiments, when the data distribution of the first user data corresponding to a certain first data field is continuous data distribution, the binning operation may be performed on the first user data in the first data field, so as to achieve the purpose of discretizing the continuous data. Wherein the binning operation may include chi-square binning, equidistant binning, equal frequency binning, and the like.
In this optional embodiment, the data distribution of the plurality of first user data in different first data fields is different, different standardization processing is performed on the plurality of first user data in the corresponding first data field according to the type of the data distribution, so that the differential processing on the different first user data is realized, the processing effect of the first user data is good, a plurality of feature data can be conveniently extracted subsequently, and therefore, the integrated prediction model can be trained.
And S12, generating a standard data vector of each user according to the standard data obtained by the standardization processing, and extracting the characteristics of the standard data vector of each user to obtain a plurality of characteristic data.
If there are M users and n first data fields, each user has n first user data, M first user data are under each first data field, and M users have M × n first user data.
Since the plurality of first user data under each first data field are standardized, the standard data under all the first data fields belonging to the same user are spliced to obtain the standard data vector of the user. The standard data vector of the ith user is denoted as (Mi1, Mi2, …, Min), and Min represents the first user data under the nth first data field of the ith user.
In an optional embodiment, the extracting the features of the standard data vector of each user to obtain a plurality of feature data includes:
extracting a plurality of cross features based on each standard data vector using an attention factorizer;
extracting a plurality of high-order features based on each standard data vector by using a multilayer perceptron;
generating a combined feature according to the plurality of combined features and the corresponding plurality of high-order features;
and carrying out normalization processing on the plurality of combined features to obtain a plurality of feature data.
Each standard data in the standard data vector of each user can be regarded as low-order feature data, a plurality of cross features extracted based on the standard data vector of each user are feature data obtained by crossing a plurality of low-order features, and a plurality of high-order features extracted based on the standard data vector of each user are feature data relative to the low-order features.
The attention factor decomposition Machine (AFM) introduces an attention mechanism into the feature crossing module, reduces the dimension of high-dimensional sparse features, represents the high-dimensional sparse features as low-dimensional dense features, corresponds each feature to a hidden vector, and multiplies a feature value by the hidden vector to obtain a feature vector as an actual feature representation. The attention factor decomposition machine can effectively extract the combined features in the dense features, and can reflect that the weights of the two combined features are different.
The multilayer perceptron can be a multilayer Deep Neural Network (DNN), and the DNN can directly obtain a high-order representation of dense features.
The computer equipment extracts a plurality of cross features in the standard data vector of each user by using AFM and a plurality of high-order features in the standard data vector of each user by using DNN, and generates combined features of each user based on the plurality of cross features and the corresponding plurality of high-order features of each user.
Due to the fact that differences exist among the combined features of different users, when the differences among the combined features are large, the integrated prediction model cannot be converged when the integrated prediction model is trained based on the combined features of the users, and therefore normalization processing is conducted on each combined feature to obtain feature data so that the integrated prediction model can be rapidly converged when the integrated prediction model is trained subsequently. Each user corresponds to one feature data, and a plurality of users correspond to a plurality of feature data.
In the optional embodiment, the extracted combined features include cross combinations among the low-order features, and different cross features have different weights and also include high-order features, so that the extracted combined features have more comprehensive feature representation, and the loss of the features is avoided, so that the integrated prediction model is trained based on the combined features, the training precision of the integrated prediction model can be improved, and the prediction accuracy of the integrated prediction model is improved; and the multiple combined features are subjected to normalization processing, so that the convergence rate of the training integrated prediction model can be increased, the training efficiency of the integrated prediction model is improved, and the prediction efficiency of the integrated prediction model is improved.
And S13, taking the plurality of feature data as training data, and taking a plurality of second user data under corresponding second data fields as training targets, and training the integrated prediction model.
Wherein the second data field may be a performance field, such as a total amount of policy sales, a total amount of agent accrues, etc.
And taking the plurality of feature data as training data of the integrated network model, taking the plurality of second user data under the second data field as training labels of the integrated network model, and performing supervised learning and training on the integrated network to obtain an integrated prediction model.
In an optional embodiment, the training of the integrated predictive model with the plurality of feature data as training data and the plurality of second user data under the corresponding second data fields as training targets includes:
inputting the plurality of feature data into a pre-training model, and extracting a feature vector of each feature data through the pre-training model;
inputting a plurality of feature vectors into a Bi-LSTM model for training;
acquiring a plurality of prediction data output by the Bi-LSTM model;
calculating residuals between a plurality of second user data under the second data field and the plurality of prediction data;
and training the Bi-LSTM model based on the residual error by adopting a forward feedback mechanism to obtain an integrated prediction model.
The integrated network model comprises a network architecture formed by cascading a pre-training model and a Bi-LSTM model, wherein the output of a CLS layer of the pre-training model is used as the input of the Bi-LSTM model.
BERT (Bidirectional Encoder Representation from Transformers) is a self-coding pre-training language model of Bidirectional text feature Representation, and when a word is processed, the information of the words in front of and behind the word can be considered, so that the semantic meaning of the context can be obtained.
The BERT model may be trained in advance on the basis of first user data in a unit of a user, so that the BERT model can process feature data of the user as processing natural language, thereby extracting feature vectors. And taking each feature data as the input of a BERT model obtained by training, then obtaining a vector of a CLS position in the BERT model to obtain a feature vector of each feature data, inputting a plurality of feature vectors into the Bi-LSTM model, and outputting a plurality of prediction data through a softmax layer of the Bi-LSTM model.
After each iterative training, calculating residual errors between the plurality of predicted data and the plurality of second user data under the second data field, judging whether the residual errors are smaller than a preset residual error threshold value, stopping training the integrated network model when the residual errors are smaller than the preset residual error threshold value, and taking the integrated network model after training as the integrated predicted model; and when the residual error is not less than the preset residual error threshold value, feeding the residual error back to the Bi-LSTM model by adopting a forward feedback mechanism, optimizing the parameters of the Bi-LSTM model, and performing the next round of training until the residual error is less than the preset residual error threshold value, and stopping the training of the integrated network model.
S14, obtaining a plurality of target characteristic data of the specified layer of the integrated prediction model, and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters.
When training of the integrated network model is finished, the integrated network model in the optimal state is determined to be the integrated prediction model when the integrated network model is considered to have reached the optimal state, at the moment, input of a softmax layer of the integrated prediction model is a plurality of target characteristic data, and the softmax layer carries out performance data prediction according to the target characteristic data so as to output a plurality of prediction data.
The specified layer is a softmax layer, and target characteristic data of the softmax layer can represent characteristic representation playing a role in contributing to performance data of the user.
The K-means clustering algorithm can be used for carrying out clustering analysis on the plurality of target characteristic data to obtain a plurality of target characteristic data clusters, and each target characteristic data cluster comprises a plurality of target characteristic data.
In this embodiment, by obtaining a plurality of target feature data of the designated layer of the integrated prediction model and clustering the plurality of target feature data, accurate clustering of a plurality of users is achieved, users with the same performance are grouped into the same class, and users with different performances are grouped into different classes.
S15, determining a second user data cluster corresponding to each target characteristic data cluster, and calculating a plurality of data binning thresholds according to the intersection of the second user data clusters.
And determining a corresponding second user data cluster according to the target characteristic data in the target characteristic data cluster, so that clustering of a plurality of second user data is realized, and a plurality of data binning thresholds are determined according to the clustered second user data clusters.
In an optional embodiment, the determining a second user data cluster corresponding to each target feature data cluster, and calculating a plurality of data binning thresholds according to an intersection of the plurality of second user data clusters includes:
determining second user data corresponding to the target characteristic data in each target characteristic data cluster to obtain a corresponding second user data cluster;
calculating central data in each second user data cluster, and sequencing a plurality of second user data clusters according to the central data;
combining every two adjacent second user data clusters in the sorted second user data clusters to obtain a second user data cluster pair;
and calculating a plurality of data binning thresholds according to a plurality of second user data cluster pairs.
Different target characteristic data correspond to different users, and the corresponding user is determined according to the target characteristic data, so that corresponding second user data is determined according to the user.
The central data refers to an average value of all second user data in each second user data cluster. The central data can represent the approximate position of most of the second user data in the corresponding second user data cluster to the greatest extent, if the central data is large, most of the second user data in the corresponding second user data cluster is large, if the central data is small, most of the second user data in the corresponding second user data cluster is small, and therefore the second user data clusters can be sequenced quickly and effectively according to the central data.
After the second user data clusters are sequentially or reversely ordered, only two adjacent second user data clusters may have an intersection, and only every two adjacent second user data clusters are combined, so that a data binning threshold can be determined, and the efficiency of determining the data binning threshold is high.
In an alternative embodiment, said calculating a plurality of data binning thresholds from a plurality of said second user data cluster pairs comprises:
when a data intersection exists between two second user data clusters in a second user data cluster pair, determining a left data intersection point and a right data intersection point in the data intersection; calculating a data binning threshold for the second user data cluster pair from the left data intersection and the right data intersection;
and when the two second user data clusters in the second user data cluster pair do not have data intersection, calculating the data binning threshold of the second user data cluster pair according to the central data in the two second user data clusters.
Taking sequential ordering of the plurality of second user data clusters as an example, when two second user data clusters in a second user data cluster pair have a data intersection, determining a left data intersection point as the smallest second user data in a second user data cluster ordered later, determining a right data intersection point as the largest second user data in a second user data cluster ordered earlier, and calculating a second user data mean value between the smallest second user data in the second user data cluster ordered later and the largest second user data in the second user data cluster ordered earlier to obtain a first data binning threshold of the second user data cluster. For example, assuming that two second user data clusters in a second user data cluster pair are (70, 71, 74, 76, 77), (75, 78, 79, 80), respectively, there is a data intersection between the two second user data clusters, the left data intersection is 75, the right data intersection is 77, and a first data binning threshold of the second user data cluster pair is 76 according to the left data intersection and the right data intersection.
Taking the sequential ordering of the plurality of second user data clusters as an example, when there is no data intersection between two second user data clusters in the second user data cluster pair, determining that the first central data is the central data in the second user data cluster ordered before, determining that the second central data is the central data in the second user data cluster ordered after, and calculating the mean value of the central data between the first central data and the second central data to obtain the second data binning threshold of the second user data cluster. For example, assuming that two second user data clusters in the second user data cluster pair are (80, 81, 85), (88, 89, 91, 92), respectively, there is no data intersection between the two second user data clusters, the first central data is 82, the second central data is 90, and the second data binning threshold of the second user data cluster pair is 86 according to the first central data and the second central data.
In an alternative embodiment, when there is no data intersection between two second user data clusters in the second user data cluster pair, the first central data may be determined as the largest second user data in the second user data cluster ranked before, the second central data may be determined as the smallest second user data in the second user data cluster ranked after, and a central data mean between the first central data and the second central data is calculated to obtain a second data binning threshold of the second user data cluster.
It should be noted that, the method of the present invention may determine the user data binning threshold in units of users, and may also determine the department data binning threshold in units of departments.
In an optional embodiment, the method further comprises:
acquiring a plurality of target characteristic data of a user to be detected according to the first data field;
predicting by using the integrated prediction model based on the target feature data to obtain prediction data of the user to be tested under the second data field;
comparing the predicted data under the second data field of the user to be tested with the plurality of data binning thresholds;
and determining the grade of the user to be tested according to the comparison result obtained by the comparison.
After the integrated prediction model is trained, the grade prediction can be carried out on the user to be tested; after the multiple data binning thresholds are determined, assessment can be performed according to the predicted data, and therefore the level of the user is determined.
Firstly, extracting a standard data vector of a user to be detected, then extracting a plurality of cross features based on the standard data vector of the user to be detected by using an attention factor decomposition machine, extracting a plurality of high-order features based on the standard data vector of the user to be detected by using a multilayer perceptron, then generating combined features according to the combined features and the corresponding high-order features, inputting the combined features into an integrated prediction model for prediction, comparing predicted data obtained by prediction with each data binning threshold value, determining a target data binning threshold value, and determining the grade of the user to be detected according to a target grade corresponding to the target data binning threshold value.
For example, assuming that the multiple data binning thresholds are T1, T2, T3 and T4, where T1< T2< T3< T4, T1 corresponds to "poor" level, T2 corresponds to "medium" level, T3 corresponds to "good" level, and T4 corresponds to "excellent" level, and if the predicted data in the second data field of the user to be tested is greater than T2 but less than T3, it is determined that the target data binning threshold of the user to be tested is T2, and the "medium" target level corresponding to the target data binning threshold T2 is the level of the user to be tested.
In summary, the data binning threshold calculation method according to the present invention adopts different normalization processes according to data distribution of first user data in different first data fields, generates a standard data vector according to the standard data obtained by the normalization processes, performs feature extraction on the standard data vector of each user to obtain a plurality of feature data, trains the integrated prediction model with the plurality of feature data as training data and the plurality of second user data in corresponding second data fields as training targets, thereby obtaining a plurality of target feature data in a specified layer of the integrated prediction model, clusters the plurality of second user data in the second data fields according to the clusters of the plurality of target feature data, and finally calculates the plurality of data binning thresholds according to an intersection of the clusters of the second user data obtained by clustering. The method combines the first user data of the user, and quickly determines the optimal data binning threshold value from a plurality of second user data. Even if the second user data changes along with the change of time, the data binning threshold value can be updated in real time and rapidly only by updating the integrated prediction model in an iteration mode, and therefore the data binning threshold value is the optimal choice under objective conditions.
It is emphasized that the integrated prediction model may be stored in a node of the blockchain in order to further ensure privacy and security of the integrated prediction model.
Fig. 2 is a structural diagram of a data binning threshold calculation apparatus according to a second embodiment of the present invention.
In some embodiments, the data binning threshold calculation device 20 may include a plurality of functional modules composed of computer program segments. The computer programs of the respective program segments in the data binning threshold calculation means 20 may be stored in a memory of a computer device and executed by at least one processor to perform the functions of the data binning threshold calculation (described in detail in fig. 1).
In this embodiment, the data binning threshold calculation device 20 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises a processing module 201, an extraction module 202, a training module 203, a clustering module 204, a calculation module 205 and a grading module 206. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The processing module 201 is configured to determine data distribution of the plurality of first user data in each first data field, and perform normalization processing on the plurality of first user data in the corresponding first data field according to the data distribution.
Each user corresponds to a plurality of first data fields, which may include, but are not limited to: local economic data, basic data, business data, etc. Wherein the local economic data may include: GDP, population size, etc., the underlying data may include: academic calendar, working years, etc., the business data may include: the turnover is averaged daily, the number of people is increased daily, and the like. The first user data field is only an example and is not used as any limitation to the present invention, and the first user data field may be determined according to an actual application scenario, so as to determine the first user data.
The plurality of first user data under different data fields have different data distributions, and the first user data under the corresponding first data field are subjected to standardized processing according to the data distributions, so that the oriented processing of the first user data under different first data fields can be realized.
In an optional embodiment, the processing module 201, normalizing the first user data under the corresponding first data field according to the data distribution includes:
when the data distribution of first user data under a first data field is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data field;
and when the data distribution of the first user data in the first data field is continuous data distribution, performing power operation on the first user data by using a preset power function to obtain a plurality of standard data in the first data field.
Some of the first user data under the first data field may be discrete (e.g., school calendar, age), and some of the first user data under the first data field may be continuous (e.g., average daily turnover, average daily increment), so that the first user data needs to be standardized according to data distribution.
In specific implementation, the computer device first obtains first user data of a plurality of users belonging to the same first data field, then determines data distribution of the plurality of first user data in each first data field, and adopts different standardized processing modes for the plurality of first user data in the first data field according to the data distribution.
In order to solve the problem that discrete data are missing, when the data distribution of first user data corresponding to a certain first data field is discrete data distribution, function fitting is performed according to the existing first user data, and then the missing data is fitted by using a fitting function, so that the missing data obtained by fitting is used for data filling, and the integrity of the first user data is ensured. For example, for the first data field of age, the age data of 10 users are 21, 22, 23, 24, 26, 27, 28, 29, 30, respectively, indicating that the age data of the 5 th user is missing, 9 age data of (1, 21), (2, 22), (3, 23), (4, 24) (6, 26), (7, 27), (8, 28), (9, 29), (10, 30) are subjected to function fitting, and then the argument 5 is substituted into the fitting function obtained by fitting, so as to obtain missing data 25.
When the data distribution of the first user data corresponding to a certain first data field is a continuous data distribution, a power function, for example, a log function, may be used to perform a power operation on each first user data in the first data field, so as to achieve the purpose of discretizing the continuous data.
In other embodiments, when the data distribution of the first user data corresponding to a certain first data field is continuous data distribution, the binning operation may be performed on the first user data in the first data field, so as to achieve the purpose of discretizing the continuous data. Wherein the binning operation may include chi-square binning, equidistant binning, equal frequency binning, and the like.
In this optional embodiment, the data distribution of the plurality of first user data in different first data fields is different, different standardization processing is performed on the plurality of first user data in the corresponding first data field according to the type of the data distribution, so that the differential processing on the different first user data is realized, the processing effect of the first user data is good, a plurality of feature data can be conveniently extracted subsequently, and therefore, the integrated prediction model can be trained.
The extracting module 202 is configured to generate a standard data vector of each user according to the standard data obtained through the normalization processing, and perform feature extraction on the standard data vector of each user to obtain a plurality of feature data.
If there are M users and n first data fields, each user has n first user data, M first user data are under each first data field, and M users have M × n first user data.
Since the plurality of first user data under each first data field are standardized, the standard data under all the first data fields belonging to the same user are spliced to obtain the standard data vector of the user. The standard data vector of the ith user is denoted as (Mi1, Mi2, …, Min), and Min represents the first user data under the nth first data field of the ith user.
In an optional embodiment, the extracting module 202 performs feature extraction on the standard data vector of each user, and obtaining a plurality of feature data includes:
extracting a plurality of cross features based on each standard data vector using an attention factorizer;
extracting a plurality of high-order features based on each standard data vector by using a multilayer perceptron;
generating a combined feature according to the plurality of combined features and the corresponding plurality of high-order features;
and carrying out normalization processing on the plurality of combined features to obtain a plurality of feature data.
Each standard data in the standard data vector of each user can be regarded as low-order feature data, a plurality of cross features extracted based on the standard data vector of each user are feature data obtained by crossing a plurality of low-order features, and a plurality of high-order features extracted based on the standard data vector of each user are feature data relative to the low-order features.
The attention factor decomposition Machine (AFM) introduces an attention mechanism into the feature crossing module, reduces the dimension of high-dimensional sparse features, represents the high-dimensional sparse features as low-dimensional dense features, corresponds each feature to a hidden vector, and multiplies a feature value by the hidden vector to obtain a feature vector as an actual feature representation. The attention factor decomposition machine can effectively extract the combined features in the dense features, and can reflect that the weights of the two combined features are different.
The multilayer perceptron can be a multilayer Deep Neural Network (DNN), and the DNN can directly obtain a high-order representation of dense features.
The computer equipment extracts a plurality of cross features in the standard data vector of each user by using AFM and a plurality of high-order features in the standard data vector of each user by using DNN, and generates combined features of each user based on the plurality of cross features and the corresponding plurality of high-order features of each user.
Due to the fact that differences exist among the combined features of different users, when the differences among the combined features are large, the integrated prediction model cannot be converged when the integrated prediction model is trained based on the combined features of the users, and therefore normalization processing is conducted on each combined feature to obtain feature data so that the integrated prediction model can be rapidly converged when the integrated prediction model is trained subsequently. Each user corresponds to one feature data, and a plurality of users correspond to a plurality of feature data.
In the optional embodiment, the extracted combined features include cross combinations among the low-order features, and different cross features have different weights and also include high-order features, so that the extracted combined features have more comprehensive feature representation, and the loss of the features is avoided, so that the integrated prediction model is trained based on the combined features, the training precision of the integrated prediction model can be improved, and the prediction accuracy of the integrated prediction model is improved; and the multiple combined features are subjected to normalization processing, so that the convergence rate of the training integrated prediction model can be increased, the training efficiency of the integrated prediction model is improved, and the prediction efficiency of the integrated prediction model is improved.
The training module 203 is configured to train the integrated prediction model by using the plurality of feature data as training data and using a plurality of second user data under corresponding second data fields as training targets.
Wherein the second data field may be a performance field, such as a total amount of policy sales, a total amount of agent accrues, etc.
And taking the plurality of feature data as training data of the integrated network model, taking the plurality of second user data under second data fields of the plurality of users as training labels of the integrated network model, and performing supervised learning and training on the integrated network to obtain the integrated prediction model.
In an alternative embodiment, the training module 203 takes the plurality of feature data as training data, and takes a plurality of second user data under corresponding second data fields as training targets, and training the integrated predictive model includes:
inputting the plurality of feature data into a pre-training model, and extracting a feature vector of each feature data through the pre-training model;
inputting a plurality of feature vectors into a Bi-LSTM model for training;
acquiring a plurality of prediction data output by the Bi-LSTM model;
calculating residuals between a plurality of second user data under the second data field and the plurality of prediction data;
and training the Bi-LSTM model based on the residual error by adopting a forward feedback mechanism to obtain an integrated prediction model.
The integrated network model comprises a network architecture formed by cascading a pre-training model and a Bi-LSTM model, wherein the output of a CLS layer of the pre-training model is used as the input of the Bi-LSTM model.
BERT (Bidirectional Encoder Representation from Transformers) is a self-coding pre-training language model of Bidirectional text feature Representation, and when a word is processed, the information of the words in front of and behind the word can be considered, so that the semantic meaning of the context can be obtained.
The BERT model may be trained in advance on the basis of first user data in a unit of a user, so that the BERT model can process feature data of the user as processing natural language, thereby extracting feature vectors. And taking each feature data as the input of a BERT model obtained by training, then obtaining a vector of a CLS position in the BERT model to obtain a feature vector of each feature data, inputting a plurality of feature vectors into the Bi-LSTM model, and outputting a plurality of prediction data through a softmax layer of the Bi-LSTM model.
After each iterative training, calculating residual errors between the plurality of predicted data and the plurality of second user data under the second data field, judging whether the residual errors are smaller than a preset residual error threshold value, stopping training the integrated network model when the residual errors are smaller than the preset residual error threshold value, and taking the integrated network model after training as the integrated predicted model; and when the residual error is not less than the preset residual error threshold value, feeding the residual error back to the Bi-LSTM model by adopting a forward feedback mechanism, optimizing the parameters of the Bi-LSTM model, and performing the next round of training until the residual error is less than the preset residual error threshold value, and stopping the training of the integrated network model.
The clustering module 204 is configured to obtain a plurality of target feature data of a specified layer of the integrated prediction model, and cluster the plurality of target feature data to obtain a plurality of target feature data clusters.
When training of the integrated network model is finished, the integrated network model in the optimal state is determined to be the integrated prediction model when the integrated network model is considered to have reached the optimal state, at the moment, input of a softmax layer of the integrated prediction model is a plurality of target characteristic data, and the softmax layer carries out performance data prediction according to the target characteristic data so as to output a plurality of prediction data.
The specified layer is a softmax layer, and target characteristic data of the softmax layer can represent characteristic representation playing a role in contributing to performance data of the user.
The K-means clustering algorithm can be used for carrying out clustering analysis on the plurality of target characteristic data to obtain a plurality of target characteristic data clusters, and each target characteristic data cluster comprises a plurality of target characteristic data.
In this embodiment, the clustering of the plurality of second user data is realized by acquiring the plurality of target feature data of the designated layer of the integrated prediction model and clustering the plurality of target feature data, users with the same performance are grouped into the same class, and users with different performances are grouped into different classes.
The calculating module 205 is configured to determine a second user data cluster corresponding to each target feature data cluster, and calculate a plurality of data binning thresholds according to an intersection of the plurality of second user data clusters.
And determining a corresponding second user data cluster according to the target characteristic data in the target characteristic data cluster, so that clustering of a plurality of second user data is realized, and a plurality of data binning thresholds are determined according to the clustered second user data clusters.
In an optional embodiment, the determining, by the computing module 205, a second user data cluster corresponding to each target feature data cluster, and computing a plurality of data binning thresholds according to an intersection of the plurality of second user data clusters includes:
determining second user data corresponding to the target characteristic data in each target characteristic data cluster to obtain a corresponding second user data cluster;
calculating central data in each second user data cluster, and sequencing a plurality of second user data clusters according to the central data;
combining every two adjacent second user data clusters in the sorted second user data clusters to obtain a second user data cluster pair;
and calculating a plurality of data binning thresholds according to a plurality of second user data cluster pairs.
Different target characteristic data correspond to different users, and the corresponding user is determined according to the target characteristic data, so that corresponding second user data is determined according to the user.
The central data refers to an average value of all second user data in each second user data cluster. The central data can represent the approximate position of most of the second user data in the corresponding second user data cluster to the greatest extent, if the central data is large, most of the second user data in the corresponding second user data cluster is large, if the central data is small, most of the second user data in the corresponding second user data cluster is small, and therefore the second user data clusters can be sequenced quickly and effectively according to the central data.
After the second user data clusters are sequentially or reversely ordered, only two adjacent second user data clusters may have an intersection, and only every two adjacent second user data clusters are combined, so that a data binning threshold can be determined, and the efficiency of determining the data binning threshold is high.
In an alternative embodiment, said calculating a plurality of data binning thresholds from a plurality of said second user data cluster pairs comprises:
when a data intersection exists between two second user data clusters in a second user data cluster pair, determining a left data intersection point and a right data intersection point in the data intersection; calculating a data binning threshold for the second user data cluster pair from the left data intersection and the right data intersection;
and when the two second user data clusters in the second user data cluster pair do not have data intersection, calculating the data binning threshold of the second user data cluster pair according to the central data in the two second user data clusters.
Taking sequential ordering of the plurality of second user data clusters as an example, when two second user data clusters in a second user data cluster pair have a data intersection, determining a left data intersection point as the smallest second user data in a second user data cluster ordered later, determining a right data intersection point as the largest second user data in a second user data cluster ordered earlier, and calculating a second user data mean value between the smallest second user data in the second user data cluster ordered later and the largest second user data in the second user data cluster ordered earlier to obtain a first data binning threshold of the second user data cluster. For example, assuming that two second user data clusters in a second user data cluster pair are (70, 71, 74, 76, 77), (75, 78, 79, 80), respectively, there is a data intersection between the two second user data clusters, the left data intersection is 75, the right data intersection is 77, and a first data binning threshold of the second user data cluster pair is 76 according to the left data intersection and the right data intersection.
Taking the sequential ordering of the plurality of second user data clusters as an example, when there is no data intersection between two second user data clusters in the second user data cluster pair, determining that the first central data is the central data in the second user data cluster ordered before, determining that the second central data is the central data in the second user data cluster ordered after, and calculating the mean value of the central data between the first central data and the second central data to obtain the second data binning threshold of the second user data cluster. For example, assuming that two second user data clusters in the second user data cluster pair are (80, 81, 85), (88, 89, 91, 92), respectively, there is no data intersection between the two second user data clusters, the first central data is 82, the second central data is 90, and the second data binning threshold of the second user data cluster pair is 86 according to the first central data and the second central data.
In an alternative embodiment, when there is no data intersection between two second user data clusters in the second user data cluster pair, the first central data may be determined as the largest second user data in the second user data cluster ranked before, the second central data may be determined as the smallest second user data in the second user data cluster ranked after, and a central data mean between the first central data and the second central data is calculated to obtain a second data binning threshold of the second user data cluster.
It should be noted that, the apparatus of the present invention may determine the user data binning threshold in units of users, and may also determine the department data binning threshold in units of departments.
The grading module 206 is configured to obtain multiple target feature data of the user to be tested according to the first data field; predicting by using the integrated prediction model based on the target feature data to obtain prediction data of the user to be tested under the second data field; comparing the predicted data under the second data field of the user to be tested with the plurality of data binning thresholds; and determining the grade of the user to be tested according to the comparison result obtained by the comparison.
After the integrated prediction model is trained, the grade prediction can be carried out on the user to be tested; after the multiple data binning thresholds are determined, assessment can be performed according to the predicted data, and therefore the level of the user is determined.
Firstly, extracting a standard data vector of a user to be detected, then extracting a plurality of cross features based on the standard data vector of the user to be detected by using an attention factor decomposition machine, extracting a plurality of high-order features based on the standard data vector of the user to be detected by using a multilayer perceptron, then generating combined features according to the combined features and the corresponding high-order features, inputting the combined features into an integrated prediction model for prediction, comparing predicted data obtained by prediction with each data binning threshold value, determining a target data binning threshold value, and determining the grade of the user to be detected according to a target grade corresponding to the target data binning threshold value.
For example, assuming that the multiple data binning thresholds are T1, T2, T3 and T4, where T1< T2< T3< T4, T1 corresponds to "poor" level, T2 corresponds to "medium" level, T3 corresponds to "good" level, and T4 corresponds to "excellent" level, and if the predicted data in the second data field of the user to be tested is greater than T2 but less than T3, it is determined that the target data binning threshold of the user to be tested is T2, and the "medium" target level corresponding to the target data binning threshold T2 is the level of the user to be tested.
In summary, the data binning threshold calculation apparatus according to the present invention adopts different normalization processes according to data distribution of first user data in different first data fields, generates a standard data vector according to the standard data obtained by the normalization processes, performs feature extraction on the standard data vector of each user to obtain a plurality of feature data, trains the integrated prediction model using the plurality of feature data as training data and using the plurality of second user data in corresponding second data fields as training targets, thereby obtaining a plurality of target feature data of a specified layer of the integrated prediction model, clusters the plurality of second user data in the second data fields according to the clusters of the plurality of target feature data, and finally calculates the plurality of data binning thresholds according to an intersection of the clusters of the second user data obtained by clustering. The method combines the first user data of the user, and quickly determines the optimal data binning threshold value from a plurality of second user data. Even if the second user data changes along with the change of time, the data binning threshold value can be updated in real time and rapidly only by updating the integrated prediction model in an iteration mode, and therefore the data binning threshold value is the optimal choice under objective conditions.
It is emphasized that the integrated predictive model may be stored in a node of the blockchain in order to further ensure privacy and security of the integrated predictive model.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the data binning threshold calculation method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or part of the steps of the data binning threshold calculation method described in the embodiments of the present invention; or implement all or part of the functions of the data binning threshold calculation means. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for calculating a data binning threshold, the method comprising:
determining data distribution of a plurality of first user data under each first data field, and performing standardization processing on the plurality of first user data under the corresponding first data field according to the data distribution;
generating a standard data vector of each user according to the standard data obtained by standardization, and performing feature extraction on the standard data vector of each user to obtain a plurality of feature data;
taking the plurality of characteristic data as training data, taking a plurality of second user data under corresponding second data fields as training targets, and training an integrated prediction model;
acquiring a plurality of target characteristic data of a specified layer of the integrated prediction model, and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters;
and determining a second user data cluster corresponding to each target characteristic data cluster, and calculating a plurality of data binning thresholds according to the intersection of the second user data clusters.
2. The method of claim 1, wherein determining a second user data cluster corresponding to each target feature data cluster, and calculating a plurality of data binning thresholds according to an intersection of the plurality of second user data clusters comprises:
determining second user data corresponding to the target characteristic data in each target characteristic data cluster to obtain a corresponding second user data cluster;
calculating central data in each second user data cluster, and sequencing a plurality of second user data clusters according to the central data;
combining every two adjacent second user data clusters in the sorted second user data clusters to obtain a second user data cluster pair;
and calculating a plurality of data binning thresholds according to a plurality of second user data cluster pairs.
3. The method of claim 2, wherein said computing a plurality of data binning thresholds from a plurality of said second user data cluster pairs comprises:
when a data intersection exists between two second user data clusters in a second user data cluster pair, determining a left data intersection point and a right data intersection point in the data intersection; calculating a data binning threshold for the second user data cluster pair from the left data intersection and the right data intersection;
and when the two second user data clusters in the second user data cluster pair do not have data intersection, calculating the data binning threshold of the second user data cluster pair according to the central data in the two second user data clusters.
4. The method of claim 3, wherein the normalizing the first user data under the corresponding first data field according to the data distribution comprises:
when the data distribution of first user data under a first data field is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data field;
and when the data distribution of the first user data in the first data field is continuous data distribution, performing power operation on the first user data by using a preset power function to obtain a plurality of standard data in the first data field.
5. The method of calculating the data binning threshold of claim 3, wherein the feature extraction of the standard data vector of each user to obtain a plurality of feature data comprises:
extracting a plurality of cross features based on each standard data vector using an attention factorizer;
extracting a plurality of high-order features based on each standard data vector by using a multilayer perceptron;
generating a combined feature according to the plurality of combined features and the corresponding plurality of high-order features;
and carrying out normalization processing on the plurality of combined features to obtain a plurality of feature data.
6. The method according to claim 4 or 5, wherein the training of the integrated predictive model with the plurality of feature data as training data and the plurality of second user data under the corresponding second data fields as training targets comprises:
inputting the plurality of feature data into a pre-training model, and extracting a feature vector of each feature data through the pre-training model;
inputting a plurality of feature vectors into a Bi-LSTM model for training;
acquiring a plurality of prediction data output by the Bi-LSTM model;
calculating residuals between a plurality of second user data under the second data field and the plurality of prediction data;
and training the Bi-LSTM model based on the residual error by adopting a forward feedback mechanism to obtain an integrated prediction model.
7. The method of data binning threshold calculation of claim 6, further comprising:
acquiring a plurality of target characteristic data of a user to be detected according to the first data field;
predicting by using the integrated prediction model based on the target feature data to obtain prediction data of the user to be tested under the second data field;
comparing the predicted data under the second data field of the user to be tested with the plurality of data binning thresholds;
and determining the grade of the user to be tested according to the comparison result obtained by the comparison.
8. An apparatus for calculating a data binning threshold, the apparatus comprising:
the processing module is used for determining the data distribution of the plurality of first user data under each first data field and carrying out standardization processing on the plurality of first user data under the corresponding first data field according to the data distribution;
the extraction module is used for generating a standard data vector of each user according to the standard data obtained by the standardization processing, and performing feature extraction on the standard data vector of each user to obtain a plurality of feature data;
the training module is used for training the integrated prediction model by taking the plurality of characteristic data as training data and taking a plurality of second user data under corresponding second data fields as training targets;
the clustering module is used for acquiring a plurality of target characteristic data of a specified layer of the integrated prediction model and clustering the plurality of target characteristic data to obtain a plurality of target characteristic data clusters;
and the calculation module is used for determining a second user data cluster corresponding to each target characteristic data cluster and calculating a plurality of data binning thresholds according to the intersection of the second user data clusters.
9. A computer device, characterized in that the computer device comprises a processor for implementing the data binning threshold calculation method according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data binning threshold calculation method according to any one of claims 1 to 7.
CN202110036327.4A 2021-01-12 2021-01-12 Data binning threshold calculation method and device, computer equipment and storage medium Pending CN112819034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110036327.4A CN112819034A (en) 2021-01-12 2021-01-12 Data binning threshold calculation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110036327.4A CN112819034A (en) 2021-01-12 2021-01-12 Data binning threshold calculation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112819034A true CN112819034A (en) 2021-05-18

Family

ID=75868866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110036327.4A Pending CN112819034A (en) 2021-01-12 2021-01-12 Data binning threshold calculation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112819034A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871166A (en) * 2016-09-27 2018-04-03 第四范式(北京)技术有限公司 For the characteristic processing method and characteristics processing system of machine learning
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
CN109543925A (en) * 2019-01-07 2019-03-29 平安科技(深圳)有限公司 Risk Forecast Method, device, computer equipment and storage medium based on machine learning
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
US20190332950A1 (en) * 2018-04-27 2019-10-31 Tata Consultancy Services Limited Unified platform for domain adaptable human behaviour inference
CN110708285A (en) * 2019-08-30 2020-01-17 中国平安人寿保险股份有限公司 Flow monitoring method, device, medium and electronic equipment
US20200050896A1 (en) * 2018-08-09 2020-02-13 Servicenow, Inc. Machine Learning Classification with Model Quality Prediction
CN110909963A (en) * 2018-09-14 2020-03-24 中国软件与技术服务股份有限公司 Credit scoring card model training method and taxpayer abnormal risk assessment method
CN112100291A (en) * 2020-09-18 2020-12-18 中国建设银行股份有限公司 Data binning method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871166A (en) * 2016-09-27 2018-04-03 第四范式(北京)技术有限公司 For the characteristic processing method and characteristics processing system of machine learning
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
US20190332950A1 (en) * 2018-04-27 2019-10-31 Tata Consultancy Services Limited Unified platform for domain adaptable human behaviour inference
US20200050896A1 (en) * 2018-08-09 2020-02-13 Servicenow, Inc. Machine Learning Classification with Model Quality Prediction
CN110909963A (en) * 2018-09-14 2020-03-24 中国软件与技术服务股份有限公司 Credit scoring card model training method and taxpayer abnormal risk assessment method
CN109543925A (en) * 2019-01-07 2019-03-29 平安科技(深圳)有限公司 Risk Forecast Method, device, computer equipment and storage medium based on machine learning
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
CN110708285A (en) * 2019-08-30 2020-01-17 中国平安人寿保险股份有限公司 Flow monitoring method, device, medium and electronic equipment
CN112100291A (en) * 2020-09-18 2020-12-18 中国建设银行股份有限公司 Data binning method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
REN CAI等: "Sentiment Analysis about Investors and Consumers in Energy Market Based on BERT-BiLSTM", 《IEEE》 *
王进等: "结合目标特定特征和目标相关性的多目标回归", 《电子学报》 *

Similar Documents

Publication Publication Date Title
Yu et al. Headway-based bus bunching prediction using transit smart card data
US10817779B2 (en) Bayesian network based hybrid machine learning
CN111950738A (en) Machine learning model optimization effect evaluation method and device, terminal and storage medium
CN111756760B (en) User abnormal behavior detection method based on integrated classifier and related equipment
WO2021139432A1 (en) Artificial intelligence-based user rating prediction method and apparatus, terminal, and medium
CN111950625A (en) Risk identification method and device based on artificial intelligence, computer equipment and medium
CN114997263B (en) Method, device, equipment and storage medium for analyzing training rate based on machine learning
CN112199417B (en) Data processing method, device, terminal and storage medium based on artificial intelligence
CN112906385A (en) Text abstract generation method, computer equipment and storage medium
CN112948275A (en) Test data generation method, device, equipment and storage medium
CN113435998A (en) Loan overdue prediction method and device, electronic equipment and storage medium
CN114663223A (en) Credit risk assessment method, device and related equipment based on artificial intelligence
CN113626606A (en) Information classification method and device, electronic equipment and readable storage medium
CN113435582A (en) Text processing method based on sentence vector pre-training model and related equipment
CN112950344A (en) Data evaluation method and device, electronic equipment and storage medium
CN112269875A (en) Text classification method and device, electronic equipment and storage medium
CN112818028B (en) Data index screening method and device, computer equipment and storage medium
WO2023040145A1 (en) Artificial intelligence-based text classification method and apparatus, electronic device, and medium
CN114880449A (en) Reply generation method and device of intelligent question answering, electronic equipment and storage medium
CN114398902A (en) Chinese semantic extraction method based on artificial intelligence and related equipment
CN117520351A (en) Data lake entering method, device, equipment and medium based on object storage
CN111651452A (en) Data storage method and device, computer equipment and storage medium
CN111679959A (en) Computer performance data determination method and device, computer equipment and storage medium
CN116108276A (en) Information recommendation method and device based on artificial intelligence and related equipment
CN112819034A (en) Data binning threshold calculation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210518

RJ01 Rejection of invention patent application after publication