CN113177613A - System resource data distribution method and device - Google Patents

System resource data distribution method and device Download PDF

Info

Publication number
CN113177613A
CN113177613A CN202110570829.5A CN202110570829A CN113177613A CN 113177613 A CN113177613 A CN 113177613A CN 202110570829 A CN202110570829 A CN 202110570829A CN 113177613 A CN113177613 A CN 113177613A
Authority
CN
China
Prior art keywords
feature
target
features
importance
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110570829.5A
Other languages
Chinese (zh)
Inventor
兰亭
徐琳玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110570829.5A priority Critical patent/CN113177613A/en
Publication of CN113177613A publication Critical patent/CN113177613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The specification relates to the technical field of machine learning, and particularly discloses a system resource data distribution method and device, wherein the method comprises the following steps: acquiring a feature data set, wherein the feature data set comprises a corresponding relation between a plurality of features of each user in a plurality of users and a label, and the label is used for representing whether the user is a user of a specified type; constructing a classifier by using the feature data set, and generating a full feature importance list, wherein the full feature importance list comprises feature importance of each feature in a plurality of features in the feature data set; determining a target input feature set based on the full feature importance list; and constructing a target classifier by using the target input feature set so as to determine whether the target user is a user of a specified type based on the target classifier, and distributing system resource data to the target user according to the type of the target user. The method in the embodiment can improve the accuracy and the efficiency of the data distribution of the system resources, thereby improving the utilization rate of the system resources.

Description

System resource data distribution method and device
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a method and an apparatus for allocating system resource data.
Background
With the rapid development of the big data service platform technology, the financial resource data service types and the selectable service channels are more and more diversified and more convenient, and the determination of the types of the users is more and more important for financial institutions. For example, the amount of business data of various enterprises such as banks is huge, and when a relevant machine learning model is constructed to determine the user type, many features are often considered in order to improve the prediction effect of the model. Therefore, the finally generated data volume of the wide structured feature table is large, and reaches millions or even tens of millions, the variety of features is large, and the model cannot run. Generally, the model operating pressure can be reduced by screening features.
Currently, before model training, features are screened by calculating WOE (Weight of Evidence) and IV (Information Value) to determine feature importance. However, the importance of the features calculated by the above method is not accurate enough, so that the model trained by the screened features has a poor prediction effect on the test set. If the user type prediction is not accurate enough, the problems of unreasonable resource data allocation, poor user experience and the like may exist.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the specification provides a method and a device for allocating system resource data, which can improve the accuracy and the efficiency of allocating the system resource data.
An embodiment of the present specification provides a method for allocating system resource data, including: acquiring a feature data set, wherein the feature data set comprises corresponding relations between a plurality of features of each user in a plurality of users and tags, and the tags are used for representing whether the users are users of a specified type; constructing a classifier by using the feature data set, and generating a full feature importance list, wherein the full feature importance list comprises feature importance of each feature in a plurality of features in the feature data set; determining a target input feature set based on the full feature importance list; and constructing a target classifier by using the target input feature set so as to determine whether the target user is a user of a specified type based on the target classifier, and distributing system resource data to the target user according to the type of the target user.
An embodiment of the present specification further provides a system resource data allocation apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a characteristic data set, the characteristic data set comprises the corresponding relation between a plurality of characteristics of each user in a plurality of users and a label, and the label is used for representing whether the user is a user of a specified type; the generating module is used for constructing a classifier by utilizing the feature data set and generating a full feature importance list, wherein the full feature importance list comprises feature importance of each feature in a plurality of features in the feature data set; the determining module is used for determining a target input feature set based on the full feature importance list; and the allocation module is used for constructing a target classifier by utilizing the target input feature set, determining whether the target user is a user of a specified type based on the target classifier, and allocating system resource data to the target user according to the type of the target user.
Embodiments of the present specification further provide a computer device, including a processor and a memory for storing processor-executable instructions, where the processor executes the instructions to implement the steps of the system resource data allocation method described in any of the above embodiments.
Embodiments of the present specification also provide a computer readable storage medium, on which computer instructions are stored, and when executed, the instructions implement the steps of the system resource data allocation method described in any of the above embodiments.
In an embodiment of the present specification, a method for allocating system resource data is provided, in which a server may obtain a feature data set, the feature data set may include a correspondence between a plurality of features of each of a plurality of users and a tag, the tag is used to characterize whether the user is a user of a specified type, the server may construct a classifier using the feature data set, and generate a full-feature importance list, where the full-feature importance list includes feature importance of each feature in the plurality of features in the feature data set, and then may determine a target input feature set based on the full-feature importance list, and construct a target classifier using the target input feature set, so that whether the target user is a user of a specified type may be determined based on the target classifier, and allocate system resource data to the target user according to the type of the target user. In the scheme, the classifier is built by utilizing the feature data set to obtain the feature importance of each feature in the features in the feature data set, and feature screening is carried out based on the feature importance of each feature obtained in the process of building the classifier; meanwhile, the feature types can be reduced through feature screening, and the model can be helped to run smoothly under the condition of large data volume; in addition, whether the target user is the user of the specified type is determined based on the trained classifier, so that the system resource data can be distributed based on the type of the user, the accuracy and the efficiency of the distribution of the system resource data can be improved, and the utilization rate of the system resource is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, are incorporated in and constitute a part of this specification, and are not intended to limit the specification. In the drawings:
FIG. 1 is a flow diagram illustrating a method for system resource data allocation in one embodiment of the present description;
FIG. 2 is a flow diagram illustrating a method for system resource data allocation in one embodiment of the present description;
FIG. 3 is a schematic diagram of a system resource data allocation apparatus in one embodiment of the present specification;
FIG. 4 shows a schematic diagram of a computer device in one embodiment of the present description.
Detailed Description
The principles and spirit of the present description will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely to enable those skilled in the art to better understand and to implement the present description, and are not intended to limit the scope of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present description may be embodied as a system, an apparatus, a method, or a computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
The embodiment of the specification provides a system resource data distribution method. In one example scenario of the present specification, the system resource data allocation method may be applied to a server. The server may obtain the feature data set. The feature data set may be stored locally in the server or in the database server. The feature data set may include a correspondence between a plurality of features of respective ones of the plurality of users and a label for characterizing whether the user is a specified type of user. The characteristics are index data related to user types generated according to business logic. The server may construct a classifier using the feature data set to generate a full feature importance list. The feature importance list may include feature importance of each feature in the plurality of features in the feature data set. The server may then determine a target input feature set from the full feature importance list. That is, the feature set used to train the target classifier is filtered out from the full feature importance list. The server may then build a target classifier using the target input feature set. The server can receive a plurality of characteristics of the target user sent by the terminal device, and after the characteristics are input into the target classifier, the server can determine whether the target user is a user of a specified type, and allocate system resource data to the target user according to the type of the target user.
Fig. 1 shows a flowchart of a system resource data allocation method in an embodiment of the present specification. Although the present specification provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure described in the embodiments and shown in the drawings. When the described method or module structure is applied in an actual device or end product, the method or module structure according to the embodiments or shown in the drawings can be executed sequentially or executed in parallel (for example, in a parallel processor or multi-thread processing environment, or even in a distributed processing environment).
Specifically, as shown in fig. 1, a method for allocating system resource data provided by an embodiment of the present specification may include the following steps:
step S10, a feature data set is obtained, where the feature data set includes a correspondence between a plurality of features of each of a plurality of users and a label, and the label is used to characterize whether the user is a user of a specified type.
The method in this embodiment may be applied to a resource allocation server. The resource allocation server may obtain the feature data set. The feature data set may include a correspondence between a plurality of features and tags for each of a plurality of users. The characteristics may be metric data relating to the user type generated according to the business logic. The tags may be used to characterize whether the user is a specified type of user. The specified type can be set according to the service requirement. For example, the specified type may be a user with a higher user risk. As another example, the specified type may be a targeted customer for which marketing is intended. As another example, the specified type may be a VIP user, or the like.
In one embodiment, the feature data set may be stored in a database server. The database server may perform feature engineering on data of a plurality of users of known types to obtain a plurality of features of the users of known types, thereby generating a feature data set. The resource allocation server may retrieve the feature data set from the database server. For example, the resource allocation server may send a fetch request to a database server, which sends the feature data set to the resource allocation server in response to the fetch request. As another example, the database server may automatically time the sending of feature data sets to the resource allocation server.
Step S12, constructing a classifier using the feature data set, and generating a full feature importance list, where the full feature importance list includes feature importance of each feature in the plurality of features in the feature data set.
After receiving the feature data set, the resource allocation server may construct a classifier using the feature data set, and generate a full feature importance list. Specifically, after receiving the feature data set, the resource allocation server may perform model training using the feature data set to obtain a classifier. In the process of model training, the feature importance of each feature of a plurality of features of China can be obtained.
Feature importance may be used to characterize how useful a feature is in predicting a target variable. The feature importance is the importance degree of the features calculated by the algorithm after model training to model prediction. The value range of the feature importance is between 0 and 1, and can also be 0 or 1, and the feature is more important when the numerical value is larger. For example, when modeling using a tree-like model, feature importance can generally be returned through feature _ import. For another example, after the model is trained, quantitative calculation of feature importance may be performed on the data, specifically, the trained model is used to score the data to calculate an evaluation index, and then for each feature in the data: randomly taking the value of the current characteristic; re-scoring the current data, and calculating an evaluation index; calculating the index change rate; a rate of change is obtained for each feature, and feature importance is quantified by ranking the rates of change.
In step S14, a target input feature set is determined based on the full feature importance list.
After generating the full feature importance list, a target input feature set may be determined. Specifically, a set of features in the full feature importance list whose feature importance satisfies a preset condition may be determined as the target input feature set. The preset conditions can be set by service personnel according to actual requirements. For example, the features in the full feature importance list are arranged in descending order according to the size of the feature importance, and a set of a preset number of features arranged in the full feature importance list in the front may be determined as the target input feature set.
And step S16, constructing a target classifier by using the target input feature set, determining whether the target user is a user of a specified type based on the target classifier, and distributing system resource data to the target user according to the type of the target user.
After the target input feature set is obtained, a target classifier can be constructed by using the target input feature set. That is, a target training set may be generated based on the target input feature set and the feature data set. And then, carrying out model training based on the target training set to obtain a target classifier. After obtaining the target classifier, the resource allocation server may receive a plurality of characteristics of the target user sent by the terminal device. The resource allocation server may input a plurality of characteristics of the target user into the target classifier to determine whether the target user is a specified type of user. The resource allocation server may then allocate system resource data to the target user according to the type of target user. For example, in the case where the target user is a user of a specified type, the system resource data is allocated to the target user, and in the case where the target user is not a user of a specified type, the system resource data is not allocated to the target user. The system resource data may include, for example, services, products, or hardware resources provided or recommended to the user. As with the loan transaction scenario, the resource data may be the amount of the loan and/or the type of the loan allocated to the target user. For a cloud platform data service business scenario, the resource data may be system resource data allocated to a target user, and the like.
In the above embodiment, the classifier is constructed by using the feature data set to obtain the feature importance of each feature in the plurality of features in the feature data set, and feature screening is performed based on the feature importance of each feature obtained in the process of constructing the classifier, and compared with calculating the feature importance before model training, the feature importance calculated in the process of constructing the classifier is more accurate, so that the prediction effect of the classifier obtained by performing model training based on the screened features is better; meanwhile, the feature types can be reduced through feature screening, and the model can be helped to run smoothly under the condition of large data volume; in addition, whether the target user is the user of the specified type is determined based on the trained classifier, so that the system resource data can be distributed based on the type of the user, the accuracy and the efficiency of the distribution of the system resource data can be improved, and the utilization rate of the system resource is improved.
In some embodiments of the present description, constructing a classifier using the feature data set to generate the full feature importance list may include: determining whether the number of the features in the feature data set is greater than a preset value; under the condition that the number of the features in the feature data set is larger than a preset value, grouping a plurality of features of the feature data set to obtain a plurality of feature data subsets, wherein the feature data subsets comprise corresponding relations between a group of features of each user in a plurality of users and tags; constructing a sub-classifier by utilizing each characteristic data subset in a plurality of characteristic data subsets to obtain a characteristic importance sub-list corresponding to each characteristic data subset, wherein the characteristic importance sub-list corresponding to each characteristic data subset comprises the characteristic importance of each characteristic in a group of characteristics in the characteristic data subset; and combining the feature importance sub-lists corresponding to the feature data subsets to generate a full feature importance list.
Under the condition that the number or the type number of a plurality of characteristics corresponding to each user in the characteristic data set is larger than a preset value, the characteristic data set needs to be grouped, and model training is carried out by using the grouped data. The preset value can be preset by service personnel according to the performance of the resource allocation server. For example, the preset value may be set to 100, 500, 1000, etc. And under the condition that the number of the features in the feature data set is determined to be larger than a preset value, grouping a plurality of features of the feature data set to obtain a plurality of feature data subsets. The feature data subset includes a corresponding relationship between a group of features of each of the plurality of users and the tag. Each group of features corresponding to each of the plurality of feature data subsets is a partial feature of the plurality of features in the feature data set. The feature data subset may include a correspondence between a set of features and tags for each of the plurality of users. In one embodiment, a plurality of features in a feature data set may be randomly grouped. In another embodiment, the plurality of features in the feature data set may be randomly grouped according to a preset rule.
The resource allocation server may construct a sub-classifier using each of the plurality of feature data subsets. That is, the resource allocation server may perform model training using each of the plurality of feature data subsets to obtain a sub-classifier corresponding to each feature data subset. In the process of constructing each sub-classifier, a feature importance sub-list corresponding to each sub-classifier can be generated. The feature importance sublist corresponding to each feature data subset may include the feature importance of each feature in a group of features in the feature data subset. The feature importance sub-lists corresponding to the feature data subsets can be merged to generate a full feature importance list. By the mode, under the condition that the number of the feature types is large, the feature data sets are grouped, model training is carried out based on the grouped data, the problem that the feature importance generated after the model training cannot be obtained due to the fact that the data size is large or the model runs still when the features are large can be solved, and a full-feature importance list can be generated.
In some embodiments of the present description, determining the target input feature set based on the full feature importance list may include: and determining a set of features with feature importance greater than a first preset threshold in the full feature importance list as a target input feature set. The first preset threshold value can be selected according to actual conditions. For example, the first preset threshold may be set to 0.01, 0.02, 0.05, or the like. By the mode, the input feature set used for model training can be screened out according to the full feature importance list.
In some embodiments of the present description, the features in the full feature importance list may be sorted in descending order by feature importance; accordingly, determining the target input feature set based on the full feature importance list may include: determining whether the number of the features of which the feature importance is greater than a first preset threshold in the full feature importance list is greater than a preset number; under the condition that the number of the features of which the feature importance is greater than a first preset threshold value in the full feature importance list is determined to be greater than a preset number, determining a set of the features of which the number is preset before the full feature importance list as a target input feature set; and under the condition that the number of the features of which the feature importance is greater than the first preset threshold in the full feature importance list is not greater than the preset number, determining a set of the features of which the feature importance is greater than the first preset threshold in the full feature importance list as a target input feature set.
Specifically, the number of features in the full feature importance list whose feature importance is greater than the first preset threshold may be determined. And then, judging whether the number of the features with the feature importance greater than the first preset threshold is greater than a preset number. Wherein the preset number may be set according to the performance of the resource server. For example, the preset number may be set to 500, 800, 1000, or the like. And under the condition that the number of the features with the feature importance greater than the first preset threshold is greater than the preset number, determining a set of the features with the preset number in the full feature importance list as a target input feature set. And under the condition that the number of the features of which the feature importance is greater than the first preset threshold in the full feature importance list is not greater than the preset number, determining a set of the features of which the feature importance is greater than the first preset threshold in the full feature importance list as a target input feature set. By the mode, the upper limit of the input characteristic quantity of the model can be controlled, so that the model can run smoothly under the condition of large data volume.
In some embodiments of the present description, determining the target input feature set based on the full feature importance list may include: determining a first input feature set based on the full feature importance list; constructing a first classifier by using a first input feature set to obtain a first input feature importance list and a first prediction result index, wherein the first input feature importance list comprises feature importance of each feature in a plurality of features in the first input feature set; determining a set of features of which the feature importance is greater than a second preset threshold in the first input feature importance list as a second input feature set; constructing a second classifier by using the second input feature set to obtain a second prediction result index; and determining a target input feature set from the first input feature set and the second input feature set according to the first prediction result index and the second prediction result index.
In particular, a first input feature set may be determined based on the full feature importance list. The features in the full feature importance list may be sorted in descending order by feature importance size. For example, a set of a preset number of features that are listed in the full feature importance list may be determined as the first input feature set. For another example, a set of features in the full feature importance list whose feature importance is greater than a first preset threshold may be determined as the first input feature set. For another example, when the number of features of which the feature importance is greater than the first preset threshold in the full feature importance list is greater than the preset number, a set of features of which the number is preset in the full feature importance list may be determined as the first input feature set; in the case that the number of features of which the feature importance is greater than the first preset threshold in the full feature importance list is not greater than the preset number, the set of features of which the feature importance is greater than the first preset threshold may be determined as the first input feature set.
After the first input feature set is obtained, a first classifier may be constructed using the first input feature set to obtain a first input feature importance list and a first prediction result index. The importance list in the first input feature includes feature importance of each feature in the plurality of features in the first input feature set. The first prediction result indicator may be a classifier evaluation indicator obtained by predicting the test set using the first classifier. For example, the first prediction result indicator may include an indicator of recall, accuracy, precision, and the like.
The server may determine a set of features of the first input feature set having a feature importance greater than a second preset threshold as the second input feature set. Wherein the second preset threshold value can be preset by service personnel. For example, the second preset threshold may be set to 0, 0.001, 0.002, etc. After the second input feature set is obtained, a second classifier may be constructed using the second input feature set to obtain a second prediction result index. The second prediction result index may be a classifier evaluation index obtained by predicting the test set by using the second classifier. For example, the second prediction index may include a recall rate, an accuracy rate, and the like.
Then, the server may determine a target input feature set from the first input feature set and the second input feature set according to the first predicted result index and the second predicted result index. For example, the input feature set corresponding to the classifier with a higher accuracy may be determined as the target input feature set. For another example, the input feature set corresponding to the classifier with a high recall rate may be determined as the target input feature set. As another example, a target input feature set may be determined from the first input feature set and the second input feature set by integrating the precision and the recall. In this way, the second input feature set is generated based on the feature importance list generated after the model training is performed on the first input feature set, and partial features may be reduced compared with the first input feature set, which means that in this training, the reduced feature importance is insufficient, the feature with insufficient importance can be further reduced, and the resource cost required for the model training and prediction can be reduced.
In some embodiments of the present description, constructing a target classifier using a set of target input features comprises: determining a training set corresponding to each preset white sample sampling proportion in the plurality of preset white sample sampling proportions according to the plurality of preset white sample sampling proportions and the target input feature set; constructing a classifier by using the training set corresponding to each preset white sample sampling proportion to obtain the classifier and a prediction result index corresponding to each preset white sample sampling proportion; and determining a target classifier according to the prediction result index corresponding to each preset white sample sampling proportion.
Specifically, after the target input feature set is determined, the white samples of the training set can be sampled according to different proportions, the model is trained, and the prediction effect of the model on the test set under different sampling proportions is compared, so that the final sampling proportion is determined. Specifically, the training set corresponding to each preset white sample sampling proportion in the plurality of preset white sample sampling proportions may be determined according to the plurality of preset white sample sampling proportions and the target input feature set. Specifically, the target feature data set may be extracted from the feature data set in accordance with the target input feature set. Wherein, the target feature data set may include a correspondence between a plurality of input features in the target input feature sets of a plurality of users and the tags. And then, generating a training set corresponding to each sampling proportion in the plurality of preset white sample sampling proportions according to the plurality of preset white sample sampling proportions and the target characteristic data set. Where a white sample may refer to characteristic data of a user that is not of a specified type. And then, constructing a classifier by using the training set corresponding to each preset white sample sampling proportion to obtain the classifier and the prediction result index corresponding to each preset white sample sampling proportion. The prediction result index may include an evaluation index such as a recall rate, an accuracy rate, and an accuracy rate. The server can determine the target classifier according to the prediction result index corresponding to each preset white sample sampling proportion. For example, the classifier with the highest accuracy may be determined as the target classifier. By the method, the model prediction effect can be improved by adjusting the sampling proportion of the white samples, so that the accuracy and the efficiency of resource allocation are further improved, and the utilization rate of system resources is improved.
In some embodiments of the present description, determining a target classifier according to a prediction result indicator corresponding to each preset white sample sampling ratio may include: determining a target white sample sampling ratio from a plurality of white sample sampling ratios according to a prediction result index corresponding to each preset white sample sampling ratio; constructing a classifier by using the sampling proportion of the target white sample, the target input feature set and the multiple groups of preset model parameters to obtain the classifier and a prediction result index corresponding to each group of preset model parameters in the multiple groups of preset model parameters; and determining a target classifier from the classifiers corresponding to each group of preset model parameters in the plurality of groups of preset model parameters according to the prediction result indexes corresponding to each group of preset model parameters.
Specifically, after obtaining the prediction result indexes corresponding to the plurality of preset white sample sampling ratios, the target white sample sampling ratio may be determined from the plurality of white sample sampling ratios according to the prediction result indexes corresponding to the respective preset white sample sampling ratios. For example, the white sample sampling ratio corresponding to the classifier with the best prediction effect may be determined as the target white sample sampling ratio. The prediction effect can be determined according to the actual demand and the prediction effect index. After the target white sample proportion is determined, a classifier can be constructed by using the target white sample sampling proportion, the target input feature set and the multiple groups of preset model parameters, and the classifier and the prediction result index corresponding to each group of preset model parameters in the multiple groups of preset model parameters are obtained. Wherein the set of predetermined model parameters may include at least one of: learning rate, tree number, tree depth. The target classifier can be determined from the classifiers corresponding to each group of preset model parameters in the plurality of groups of preset model parameters according to the prediction result indexes corresponding to each group of preset model parameters. By the method, the classifier can be optimized by adjusting the model parameters, so that the accuracy and the efficiency of resource allocation can be further improved, and the utilization rate of system resources is improved.
The above method is described below with reference to a specific example, however, it should be noted that the specific example is only for better describing the present specification and should not be construed as an undue limitation on the present specification.
Referring to fig. 2, a flowchart of a system resource allocation method in the present embodiment is shown, and as shown in fig. 2, the present embodiment can be generally divided into two rounds. A first round: based on the feature width table with the tags of the structured data, all feature grouping training models generate feature importance, the screened feature set F1 is used as an input feature, the training models generate feature importance, and the screened feature set F2 is used as an input feature to train the models. And comparing the prediction effects of the models with the input feature sets F1 and F2 on the test set, and respectively acquiring the input feature sets with higher baseline accuracy and higher overall accuracy. Then, based on the selected input feature set, sampling white samples of the training set according to different proportions, training the model, comparing the prediction effects of the model on the test set under different sampling proportions, and respectively obtaining the sampling proportions of the white samples of the input feature set and the training set, wherein the sampling proportions are higher in baseline accuracy and overall accuracy. Then, based on the selected input feature set and the white sample sampling proportion of the training set, training models are respectively trained according to different learning rates, tree numbers and tree depths, and the prediction effects of the models on the test set under different model parameters (namely the learning rate, the tree numbers and the tree depths) are compared, so that the input feature set, the white sample sampling proportion of the training set, the learning rate, the tree numbers, the tree depths and the training duration with higher baseline accuracy and higher overall accuracy are respectively obtained. And in the second round, the steps in the first round are repeated, and the importance of the features generated by each training and the accuracy of the predicted effect are changed due to different groups or other reasons, so that the results in each round are not completely consistent. Finally, comparing the prediction effects of the final test sets of each round, and respectively obtaining an input feature set with higher baseline accuracy and higher overall accuracy, a white sample sampling proportion of a training set, a learning rate, the number of trees, tree depth and training duration as model tuning results for reference of model training personnel.
As shown in fig. 2, the method for allocating system resource data in this embodiment may specifically include the following steps:
the default parameters PARAM of the tree model are 0.6 of learning rate, 50 of tree number and 6 of tree depth.
S101: a full feature importance list T0 is obtained.
Specifically, based on the feature wide table with the structured data tags, training models are grouped according to 1000 features of each group, feature importance of each group is generated, the feature importance is combined and then sorted in a descending order according to the feature importance, and a full-feature importance list T0 is generated. The characteristics are index data which is generated according to business logic and is related to target clients (two classified prediction objects, wherein positive samples are the target clients, and negative samples are the non-target clients). The feature importance is the importance degree of the features calculated by the algorithm after model training on model prediction, the range is between 0 and 1, and the larger the numerical value is, the more important the features are. The full feature importance list T0 may be as shown in Table 1 below:
TABLE 1
Feature(s) Importance of features
Feature 1 0.7
Feature 2 0.5
... ...
Feature 3001 0.00004
Feature 3002 0.00001
... ...
S102: and judging whether the number of the features with the feature importance greater than zero in the full feature importance list T0 is greater than 1000, if so, executing the step S103, otherwise, executing the step S104.
S103: the top 1000 features with feature importance > of 0.01 are taken as a feature set F1.
S104: if the number is 1000 or less, the feature importance > is 0.01, and the feature set F1 is obtained.
Specifically, when the amount of million data exceeds 1000 features, the algorithm may not run, so the number of features after screening is controlled to be within 1000. The feature importance list generated after the model training may reduce part of the features compared to the input features, which means that the reduced feature importance is insufficient in the training. Taking the feature importance > -0.01 feature, the model training can be focused on the important features.
S105: and training a model by taking the feature set F1 as an input feature, predicting a test set, and respectively obtaining a test set recall rate-accuracy rate list PR1, a feature importance list T1 and a training duration TIME 1. The recall-precision list PR1 represents the highest precision for each recall, as shown in table 2 below. The feature importance list T1 is shown in table 3 below. The training duration TIME1 is the duration of model training.
TABLE 2
Recall rate Rate of accuracy (highest)
Recall value 1 Precision value 1
Recall value 2 Precision value 2
... ...
TABLE 3
Feature(s) Importance of features
Feature 1 0.6
Feature 2 0.55
... ...
S106: and taking the features with the feature importance greater than zero in the feature importance list T1 list as a feature set F2.
Specifically, based on a feature importance list generated after model training, the features with feature importance greater than zero are taken as input, the model is trained, and the prediction effect of the model on the test set is observed. The feature importance list generated after the model training may reduce part of the features compared to the input features, which means that the reduced feature importance is insufficient in the training.
S107: and training a model by taking the feature set F2 as an input feature, predicting a test set, and respectively obtaining a test set recall rate-accuracy rate list PR2, a feature importance list T2 and a training duration TIME 2.
S108: according to the baseline recall rate R0 and the baseline accuracy rate PRE0, acquiring: a higher baseline accuracy PR list PR _ a1, a higher baseline accuracy input feature set F _ a1, a higher overall accuracy PR list PR _ a2, and a higher overall accuracy input feature set F _ a 2.
Specifically, a recall rate R0 and an accuracy rate PRE0 before model tuning are acquired as reference standards. The recall R01 closest to the baseline recall R0 is obtained. In PR1 and PR2, the accuracy corresponding to R01 is obtained, the PR with the higher accuracy is taken as PR _ A1, and the set of input features is F _ A1. And if the accuracy rates corresponding to R01 are consistent, taking the feature set with less input feature number and the PR list thereof. And if the accuracy rate and the input feature number corresponding to the R01 are consistent, taking the input feature set with shorter training time and the PR list thereof. If the accuracy rate, the input feature number and the training duration corresponding to the R01 are consistent, one of the input feature sets and the PR list thereof is selected.
And comparing the numbers of the PR1 with the PR2 with higher precision under the same recall rate, and listing the PR with the higher precision as PR _ A2 and the input feature set as F _ A2. And if the numbers with higher accuracy are consistent, taking the feature set with less input feature number and the PR list thereof. And if the number with higher accuracy rate is consistent with the input feature number, taking the input feature set with shorter training time and the PR list thereof. And if the number, the input feature number and the training duration with higher accuracy are consistent, taking one of the input feature sets and the PR list thereof.
S201: sampling training set white samples (fixed sampling seed is 50), respectively sampling 30%, 50% and 80% of the training set white samples, respectively inputting feature sets F _ a1 and F _ a2, training a model, predicting a test set, and respectively obtaining a test set recall rate-precision rate list, a feature importance list and training duration, as shown in table 4 below. The fixed sampling seed 50 is to fix the sampled sample and reduce the influence of the change of the sampled sample. The training set white samples refer to samples of non-target customers in the training set.
TABLE 4
Figure BDA0003082542970000131
S202: according to the baseline recall rate R0 and the baseline accuracy rate PRE0, acquiring: the PR list PR _ A3 with a high baseline accuracy rate, the input feature set F _ A3 with a high baseline accuracy rate, the white sample sampling proportion W3 of the training set with a high baseline accuracy rate, the PR list PR _ A4 with a high overall accuracy rate, the input feature set F _ A4 with a high overall accuracy rate and the white sample sampling proportion W4 of the training set with a high overall accuracy rate.
And obtaining a recall ratio R0 and an accuracy ratio PRE0 before model tuning as comparison benchmarks. The recall R01 closest to the baseline recall R0 is obtained. PR3_ F _ a1_ W30, PR3_ F _ a1_ W50, PR3_ F _ a1_ W80, PR4_ F _ a2_ W30, PR4_ F _ a2_ W50, PR4_ F _ a2_ W80, PR _ a1 (non-sampling), PR _ a2 (non-sampling) are compared, and the non-sampling ratio is recorded as 100%. And acquiring the accuracy rate corresponding to R01, wherein the PR list with higher accuracy rate is PR _ A3, the input feature set is F _ A3, and the sampling proportion of the white samples of the training set is W3. And if the accuracy rates corresponding to R01 are consistent, taking the feature set with less input feature number, the PR list thereof and the white sample sampling proportion of the training set. And if the accuracy rate and the input feature number corresponding to the R01 are consistent, taking the input feature set with shorter training time length and the PR list thereof and the white sample sampling proportion of the training set. If the accuracy rate, the input feature number and the training duration corresponding to R01 are consistent, one of the input feature set, the PR list thereof and the white sample sampling proportion of the training set is selected.
And comparing the numbers of the PR lists with higher precision under the same recall rate, taking the PR list with higher precision as PR _ A4, wherein the input feature set is F _ A4, and the sampling proportion of the white samples of the training set is W4. And if the numbers with higher accuracy are consistent, taking the feature set with less input feature number and the PR list thereof as well as the white sample sampling proportion of the training set. And if the number with higher accuracy rate is consistent with the input feature number, taking the input feature set with shorter training time and the PR list thereof and the white sample sampling proportion of the training set. And if the number, the input feature number and the training duration with higher accuracy are consistent, selecting one of the input feature set and the PR list and the white sample sampling proportion of the training set.
S301: and adjusting tree model parameters. Inputting a feature set F _ A3, a training set white sample sampling proportion W3, and respectively obtaining a test set recall rate-precision rate list, a feature importance list and training duration according to a learning rate of 0.2 and 0.6, a tree number of 50, 100, 200 and 400, a tree depth of 6 and 10, a training model and a prediction test set. Inputting a feature set F _ A4, a training set white sample sampling proportion W4, and respectively obtaining a test set recall rate-precision rate list, a feature importance list and training duration according to a learning rate of 0.2 and 0.6, a tree number of 50, 100, 200 and 400, a tree depth of 6 and 10, a training model and a prediction test set. The tree model parameters of PR _ A3, PR _ a4 are the default parameters PARAM (learning rate 0.6, tree number 50, tree depth 6), so the experiments for this set of parameters have been completed in the previous step.
S302: according to the baseline recall rate R0 and the baseline accuracy rate PRE0, acquiring: a PR list PR _ a5 with a high baseline accuracy rate, an input feature set F _ a5 with a high baseline accuracy rate, a white sample sampling ratio W5 of a training set with a high baseline accuracy rate, model parameters PARAM5 (learning rate, number of trees, tree depth) with a high baseline accuracy rate, a training duration TIME5 with a high baseline accuracy rate, a PR list PR _ a6 with a high overall accuracy rate, an input feature set F _ a6 with a high overall accuracy rate, a white sample sampling ratio W6 of a training set with a high overall accuracy rate, model parameters PARAM6 (learning rate, number of trees, tree depth) with a high overall accuracy rate, and a training duration TIME6 with a high overall accuracy rate.
PR _ A3 (default parameter PARAM), PR _ A4 (default parameter PARAM) are compared with the recall-accuracy lists of each test set of step S301. And obtaining a recall ratio R0 and an accuracy ratio PRE0 before model tuning as comparison benchmarks. The recall R01 closest to the baseline recall R0 is obtained. Obtaining the accuracy rate corresponding to R01, taking the PR list with higher accuracy rate as PR _ A5, taking the input feature set as F _ A5, taking the white sample sampling proportion of the training set as W5, taking the model parameter as PARAM5, and taking the training TIME TIME 5. And if the accuracy rates corresponding to the R01 are consistent, taking the feature set with less input feature number and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training duration. And if the accuracy rate and the input feature number corresponding to the R01 are consistent, taking the input feature set with shorter training time length and a PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length. And if the accuracy rate, the input feature number and the training time length corresponding to the R01 are consistent, selecting one of the input feature set and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length.
And comparing the number of the PR lists with higher accuracy under the same recall rate, taking the PR list with higher accuracy as PR _ A6, taking the PR list with higher accuracy as F _ A6 as an input feature set, taking the white sample sampling proportion of the training set as W6, taking the model parameter as PARAM6 and taking the training TIME TIME 6. And if the numbers with higher accuracy are consistent, taking the feature set with less input feature number and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training duration. And if the number with higher accuracy rate is consistent with the input feature number, taking the input feature set with shorter training time length and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length. And if the number, the input feature number and the training time length with higher accuracy are consistent, selecting one of the input feature set and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length.
S401: and judging whether the current round is the second round. If not, the steps of the first round are repeated, and the results of each round are not completely consistent because the importance of the features generated by each training and the accuracy rate of the prediction effect are changed. If yes, go to step S402.
S402: according to the baseline recall rate R0 and the baseline accuracy rate PRE0, acquiring: the system comprises a PR list PR _ B1 with a high baseline accuracy rate, an input feature set F _ B1 with a high baseline accuracy rate, a training set white sample sampling proportion W _ B1 with a high baseline accuracy rate, a model parameter PARAM _ B1 with a high baseline accuracy rate, a training duration TIME _ B1 with a high baseline accuracy rate, a PR list PR _ B2 with a high overall accuracy rate, an input feature set F _ B2 with a high overall accuracy rate, a training set white sample sampling proportion W _ B2 with a high overall accuracy rate, a model parameter PARAM _ B2 with a high overall accuracy rate and a training duration TIME _ B2 with a high overall accuracy rate.
Each round PR _ a5, PR _ a6, F _ a5, F _ a6, TIME5, TIME6 were compared. And obtaining a recall ratio R0 and an accuracy ratio PRE0 before model tuning as comparison benchmarks. The recall R01 closest to the baseline recall R0 is obtained. Obtaining the accuracy rate corresponding to R01, taking the PR list with higher accuracy rate as PR _ B1, taking the input feature set as F _ B1, taking the sampling proportion of white samples of the training set as W _ B1, taking the model parameter as PARAM _ B1 and taking the training TIME duration TIME _ B1. And if the accuracy rates corresponding to the R01 are consistent, taking the feature set with less input feature number and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training duration. And if the accuracy rate and the input feature number corresponding to the R01 are consistent, taking the input feature set with shorter training time length and a PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length. And if the accuracy rate, the input feature number and the training time length corresponding to the R01 are consistent, selecting one of the input feature set and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length.
And comparing the number of the PR lists with higher accuracy under the same recall rate, taking the PR list with higher accuracy as PR _ B2, taking the PR list with higher accuracy as an input feature set as F _ B2, taking the sampling proportion of the white samples of the training set as W _ B2, and taking the model parameters as PARAM _ B2 and the training TIME TIME _ B2. And if the numbers with higher accuracy are consistent, taking the feature set with less input feature number and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training duration. And if the number with higher accuracy rate is consistent with the input feature number, taking the input feature set with shorter training time length and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length. And if the number, the input feature number and the training time length with higher accuracy are consistent, selecting one of the input feature set and the PR list thereof, the white sample sampling proportion of the training set, the model parameters and the training time length.
S403: and (3) final output: the method comprises the following steps of inputting a feature set F _ B1 with a high baseline accuracy rate, a recall rate-accuracy rate list PR _ B1, a white sample sampling proportion W _ B1 of a training set, model parameters PARAM _ B1 and training TIME TIME _ B1; the method comprises the steps of an input feature set F _ B2 with high overall accuracy, a recall ratio-accuracy list PR _ B2, a training set white sample sampling proportion W _ B2, model parameters PARAM _ B2 and training TIME TIME _ B2.
The model inputs corresponding to higher baseline accuracy and higher overall accuracy may be different, so the final output distinguishes the two cases.
In the embodiment, the problem that the importance of the features generated after model training cannot be obtained due to the fact that the model cannot run when the data size is large and the features are many is solved by performing feature grouping training on large-data multi-feature (more than millions of data and more than three thousand features) structured data. By obtaining the feature importance generated after model training, the difference between the feature importance calculated in advance and the feature importance generated after model training is reduced. By taking the first N important features, the upper limit of the feature quantity of the model is controlled, and the model can be operated smoothly under the condition of large data volume. The model is trained based on the important features generated after model training, and the range of input features is narrowed. And the prediction effect of the model on the test set is further improved by sampling white samples of the training set and adjusting parameters of the model. In the three stages of feature screening, sampling and parameter adjustment, the inputs with higher baseline accuracy and higher overall accuracy are respectively found out, and the conditions of different model inputs corresponding to higher baseline accuracy and higher overall accuracy in practice are considered. And finally, the two-classification tree model optimization of the structured data with large data volume and multiple features is realized. Whether the user is a target client or not is determined based on the optimized binary tree model (namely whether the user is determined to be a user of a specified type or not), accuracy can be improved, system resource data are distributed based on the type of the user, accuracy and efficiency of distribution of the system resource data can be improved, and utilization rate of the system resources is improved.
Based on the same inventive concept, the embodiment of the present specification further provides a system resource data allocation apparatus, as described in the following embodiments. Because the principle of the system resource data allocation apparatus for solving the problem is similar to the system resource data allocation method, the implementation of the system resource data allocation apparatus can refer to the implementation of the system resource data allocation method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 3 is a block diagram of a structure of a system resource data allocation apparatus according to an embodiment of the present specification, and as shown in fig. 3, the apparatus includes: the acquisition module 301, the generation module 302, the determination module 303, and the assignment module 304, and the structure will be described below.
The obtaining module 301 is configured to obtain a feature data set, where the feature data set includes a correspondence between a plurality of features of each of a plurality of users and a tag, and the tag is used to characterize whether the user is a user of a specified type.
The generating module 302 is configured to construct a classifier using the feature data set, and generate a full feature importance list, where the full feature importance list includes feature importance of each feature in the plurality of features in the feature data set.
The determining module 303 is configured to determine a target input feature set based on the full feature importance list.
The allocation module 304 is configured to construct a target classifier using the target input feature set, to determine whether the target user is a user of a specified type based on the target classifier, and to allocate system resource data to the target user according to the type of the target user.
In some embodiments of the present description, the generation module may be specifically configured to: determining whether the number of the features in the feature data set is greater than a preset value; under the condition that the number of the features in the feature data set is larger than a preset value, grouping a plurality of features of the feature data set to obtain a plurality of feature data subsets, wherein the feature data subsets comprise corresponding relations between a group of features of each user in a plurality of users and tags; constructing a sub-classifier by utilizing each characteristic data subset in a plurality of characteristic data subsets to obtain a characteristic importance sub-list corresponding to each characteristic data subset, wherein the characteristic importance sub-list corresponding to each characteristic data subset comprises the characteristic importance of each characteristic in a group of characteristics in the characteristic data subset; and combining the feature importance sub-lists corresponding to the feature data subsets to generate a full feature importance list.
In some embodiments of the present description, the determining module may be specifically configured to: and determining a set of features with feature importance greater than a first preset threshold in the full feature importance list as a target input feature set.
In some embodiments of the present description, the features in the full feature importance list are sorted in descending order by feature importance; accordingly, the determining module may be specifically configured to: determining whether the number of the features of which the feature importance is greater than a first preset threshold in the full feature importance list is greater than a preset number; under the condition that the number of the features of which the feature importance is greater than a first preset threshold value in the full feature importance list is determined to be greater than a preset number, determining a set of the features of which the number is preset before the full feature importance list as a target input feature set; and under the condition that the number of the features of which the feature importance is greater than the first preset threshold in the full feature importance list is not greater than the preset number, determining a set of the features of which the feature importance is greater than the first preset threshold in the full feature importance list as a target input feature set.
In some embodiments of the present description, the determining module may be specifically configured to: determining a first input feature set based on the full feature importance list; constructing a first classifier by using a first input feature set to obtain a first input feature importance list and a first prediction result index, wherein the first input feature importance list comprises feature importance of each feature in a plurality of features in the first input feature set; determining a set of features of which the feature importance is greater than a second preset threshold in the first input feature importance list as a second input feature set; constructing a second classifier by using the second input feature set to obtain a second prediction result index; and determining a target input feature set from the first input feature set and the second input feature set according to the first prediction result index and the second prediction result index.
In some embodiments of the present description, the allocation module may be specifically configured to: determining a training set corresponding to each preset white sample sampling proportion in the plurality of preset white sample sampling proportions according to the plurality of preset white sample sampling proportions and the target input feature set; constructing a classifier by using the training set corresponding to each preset white sample sampling proportion to obtain the classifier and a prediction result index corresponding to each preset white sample sampling proportion; and determining a target classifier according to the prediction result index corresponding to each preset white sample sampling proportion.
In some embodiments of the present description, determining a target classifier according to a prediction result indicator corresponding to each preset white sample sampling ratio includes: determining a target white sample sampling ratio from a plurality of white sample sampling ratios according to a prediction result index corresponding to each preset white sample sampling ratio; constructing a classifier by using the sampling proportion of the target white sample, the target input feature set and the multiple groups of preset model parameters to obtain the classifier and a prediction result index corresponding to each group of preset model parameters in the multiple groups of preset model parameters; and determining a target classifier from the classifiers corresponding to each group of preset model parameters in the plurality of groups of preset model parameters according to the prediction result indexes corresponding to each group of preset model parameters.
From the above description, it can be seen that the embodiments of the present specification achieve the following technical effects: the classifier is built by utilizing the feature data set, the feature importance of each feature in the features in the feature data set is obtained, feature screening is carried out based on the feature importance of each feature obtained in the process of building the classifier, and compared with the method for calculating the feature importance before model training, the feature importance calculated in the process of building the classifier is more accurate, so that the prediction effect of the classifier obtained by model training based on the screened features is better; meanwhile, the feature types can be reduced through feature screening, and the model can be helped to run smoothly under the condition of large data volume; in addition, whether the target user is the user of the specified type is determined based on the trained classifier, so that the system resource data can be distributed based on the type of the user, the accuracy and the efficiency of the distribution of the system resource data can be improved, and the utilization rate of the system resource is improved.
The embodiment of the present specification further provides a computer device, which may specifically refer to a schematic structural diagram of a computer device based on the system resource data allocation method provided in the embodiment of the present specification, shown in fig. 4, where the computer device may specifically include an input device 41, a processor 42, and a memory 43. Wherein the memory 43 is for storing processor executable instructions. The processor 42, when executing the instructions, performs the steps of the system resource data allocation method described in any of the embodiments above.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input device may include a keyboard, a mouse, a camera, a scanner, a light pen, a handwriting input board, a voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects of the specific implementation of the computer device can be explained in comparison with other embodiments, and are not described herein again.
The present specification also provides a computer storage medium based on the system resource data allocation method, and the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium implements the steps of the system resource data allocation method in any of the above embodiments.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present specification described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the description should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the present disclosure, and is not intended to limit the present disclosure, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiment of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.

Claims (10)

1. A method for allocating system resource data, comprising:
acquiring a feature data set, wherein the feature data set comprises corresponding relations between a plurality of features of each user in a plurality of users and tags, and the tags are used for representing whether the users are users of a specified type;
constructing a classifier by using the feature data set, and generating a full feature importance list, wherein the full feature importance list comprises feature importance of each feature in a plurality of features in the feature data set;
determining a target input feature set based on the full feature importance list;
and constructing a target classifier by using the target input feature set, determining whether a target user is a user of a specified type based on the target classifier, and distributing system resource data to the target user according to the type of the target user.
2. The method of claim 1, wherein constructing a classifier using the feature dataset to generate a full feature importance list comprises:
determining whether the number of the features in the feature data set is greater than a preset value;
under the condition that the number of the features in the feature data set is larger than a preset value, grouping the plurality of features of the feature data set to obtain a plurality of feature data subsets, wherein the feature data subsets comprise corresponding relations between a group of features of each user in a plurality of users and tags;
constructing a sub-classifier by utilizing each feature data subset in the feature data subsets to obtain a feature importance sub-list corresponding to each feature data subset, wherein the feature importance sub-list corresponding to each feature data subset comprises the feature importance of each feature in a group of features in the feature data subset;
and combining the characteristic importance sub-lists corresponding to the characteristic data subsets to generate a full characteristic importance list.
3. The method of claim 1, wherein determining a target input feature set based on the full feature importance list comprises:
and determining a set of features of which the feature importance is greater than a first preset threshold in the full feature importance list as a target input feature set.
4. The method of claim 1, wherein the features in the full feature importance list are sorted in descending order by feature importance;
correspondingly, determining a target input feature set based on the full feature importance list comprises:
determining whether the number of the features of which the feature importance is greater than a first preset threshold in the full feature importance list is greater than a preset number;
under the condition that the number of the features of which the feature importance is greater than a first preset threshold value in the full feature importance list is determined to be greater than a preset number, determining a set of the features of which the number is preset before the full feature importance list as a target input feature set;
and under the condition that the number of the features of which the feature importance is greater than the first preset threshold in the full feature importance list is determined to be not greater than the preset number, determining a set of the features of which the feature importance is greater than the first preset threshold in the full feature importance list as a target input feature set.
5. The method of claim 1, wherein determining a target input feature set based on the full feature importance list comprises:
determining a first input feature set based on the full feature importance list;
constructing a first classifier by using the first input feature set to obtain a first input feature importance list and a first prediction result index, wherein the first input feature importance list comprises feature importance of each feature in a plurality of features in the first input feature set;
determining a set of features of which the feature importance is greater than a second preset threshold in the first input feature importance list as a second input feature set;
constructing a second classifier by using the second input feature set to obtain a second prediction result index;
and determining a target input feature set from the first input feature set and the second input feature set according to the first predicted result index and the second predicted result index.
6. The method of claim 1, wherein constructing a target classifier using the set of target input features comprises:
determining a training set corresponding to each preset white sample sampling proportion in the preset white sample sampling proportions according to the preset white sample sampling proportions and the target input feature set;
constructing a classifier by using the training set corresponding to each preset white sample sampling proportion to obtain the classifier and a prediction result index corresponding to each preset white sample sampling proportion;
and determining a target classifier according to the prediction result index corresponding to each preset white sample sampling proportion.
7. The method according to claim 6, wherein determining the target classifier according to the prediction result index corresponding to each preset white sample sampling ratio comprises:
determining a target white sample sampling ratio from the plurality of white sample sampling ratios according to the prediction result indexes corresponding to the preset white sample sampling ratios;
constructing a classifier by using the target white sample sampling proportion, the target input feature set and multiple groups of preset model parameters to obtain a classifier and a prediction result index corresponding to each group of preset model parameters in the multiple groups of preset model parameters;
and determining a target classifier from the classifiers corresponding to each group of preset model parameters in the multiple groups of preset model parameters according to the prediction result indexes corresponding to each group of preset model parameters.
8. A system resource data allocation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a characteristic data set, the characteristic data set comprises corresponding relations between a plurality of characteristics of each user in a plurality of users and tags, and the tags are used for representing whether the user is a user of a specified type;
a generating module, configured to construct a classifier using the feature data set, and generate a full feature importance list, where the full feature importance list includes feature importance of each feature in a plurality of features in the feature data set;
the determining module is used for determining a target input feature set based on the full feature importance list;
and the allocation module is used for constructing a target classifier by utilizing the target input feature set, determining whether a target user is a user of a specified type based on the target classifier, and allocating system resource data to the target user according to the type of the target user.
9. A computer device comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium having computer instructions stored thereon which, when executed, implement the steps of the method of any one of claims 1 to 7.
CN202110570829.5A 2021-05-25 2021-05-25 System resource data distribution method and device Pending CN113177613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110570829.5A CN113177613A (en) 2021-05-25 2021-05-25 System resource data distribution method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110570829.5A CN113177613A (en) 2021-05-25 2021-05-25 System resource data distribution method and device

Publications (1)

Publication Number Publication Date
CN113177613A true CN113177613A (en) 2021-07-27

Family

ID=76928177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110570829.5A Pending CN113177613A (en) 2021-05-25 2021-05-25 System resource data distribution method and device

Country Status (1)

Country Link
CN (1) CN113177613A (en)

Similar Documents

Publication Publication Date Title
CN111614690B (en) Abnormal behavior detection method and device
CN111797928A (en) Method and system for generating combined features of machine learning samples
CN111596924B (en) Micro-service dividing method and device
CN111464583A (en) Computing resource allocation method, device, server and storage medium
CN110969172A (en) Text classification method and related equipment
CN111797320A (en) Data processing method, device, equipment and storage medium
WO2023024408A1 (en) Method for determining feature vector of user, and related device and medium
CN110019193B (en) Similar account number identification method, device, equipment, system and readable medium
CN113190696A (en) Training method of user screening model, user pushing method and related devices
CN107644042B (en) Software program click rate pre-estimation sorting method and server
Śniegula et al. Study of machine learning methods for customer churn prediction in telecommunication company
CN112100177A (en) Data storage method and device, computer equipment and storage medium
CN116089367A (en) Dynamic barrel dividing method, device, electronic equipment and medium
CN113850346B (en) Edge service secondary clustering method and system for multi-dimensional attribute perception in MEC environment
CN113177613A (en) System resource data distribution method and device
CN113434273B (en) Data processing method, device, system and storage medium
CN115525230A (en) Storage resource allocation method and device, storage medium and electronic equipment
CN111209105A (en) Capacity expansion processing method, capacity expansion processing device, capacity expansion processing equipment and readable storage medium
CN115375453A (en) System resource allocation method and device
CN112529428A (en) Method and equipment for evaluating operation efficiency of bank outlet equipment
CN114510405A (en) Index data evaluation method, index data evaluation device, index data evaluation apparatus, storage medium, and program product
CN112308419A (en) Data processing method, device, equipment and computer storage medium
CN110764907A (en) Cloud computing resource map construction method
CN111859057A (en) Data feature processing method and data feature processing device
CN111127184A (en) Distributed combined credit evaluation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination