CN115329909A - User portrait generation method and device and computer equipment - Google Patents

User portrait generation method and device and computer equipment Download PDF

Info

Publication number
CN115329909A
CN115329909A CN202211264457.4A CN202211264457A CN115329909A CN 115329909 A CN115329909 A CN 115329909A CN 202211264457 A CN202211264457 A CN 202211264457A CN 115329909 A CN115329909 A CN 115329909A
Authority
CN
China
Prior art keywords
user
data
sample
regression
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211264457.4A
Other languages
Chinese (zh)
Inventor
顾凌云
张涛
魏玉民
叶杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai IceKredit Inc
Original Assignee
Shanghai IceKredit Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai IceKredit Inc filed Critical Shanghai IceKredit Inc
Priority to CN202211264457.4A priority Critical patent/CN115329909A/en
Publication of CN115329909A publication Critical patent/CN115329909A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a user portrait generation method, a device and computer equipment, relating to the field of finance, and providing a user portrait generation method, which comprises the following steps: acquiring user data of a target user, wherein the user data comprises transaction data of the target user; performing box separation processing on the user data based on a histogram algorithm to obtain box separation data; inputting the box data into a tree model composed of a plurality of regression trees in a user portrait generation module obtained through pre-training to obtain a prediction characteristic; and inputting the predicted features into a logistic regression model in the user portrait generation module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information. So, compare traditional manual work and box separating operation, this embodiment efficiency is higher and avoided manual operation's contingency to, can excavate the nonlinear relation of characteristic, solved the problem that characteristic selection and characteristic are crossed effectively.

Description

User portrait generation method and device and computer equipment
Technical Field
The invention relates to the field of finance, in particular to a user portrait generation method and device and computer equipment.
Background
In the financial field, the traditional user portrait is mostly realized by adopting a logistic regression algorithm to realize a binary model, however, in the related technology, the logistic regression algorithm assumes that independent variables and response variables are linearly related and independent from each other, so that the information of the nonlinear relation of characteristic variables and the cross combination of the characteristics is difficult to mine; in addition, when data is subjected to data binning by using WOE (weight of evidence) coding, manual binning operation is too much depended on, and therefore, how to improve mining of nonlinear relations by a model and reduce contingency of manual operation is an urgent problem to be solved.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a user portrait generation method, apparatus and computer device.
According to a first aspect of the embodiments of the present disclosure, there is provided a user portrait generation method, including:
acquiring user data of a target user, wherein the user data comprises transaction data of the target user;
performing box separation processing on the user data based on a histogram algorithm to obtain box separation data;
inputting the box data into a tree model consisting of a plurality of regression trees in a user portrait generation module obtained by pre-training to obtain a prediction characteristic;
and inputting the predicted features into a logistic regression model in the user portrait generation module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information.
Optionally, the inputting the binning data into a tree model trained in advance to obtain a plurality of regression trees in the user portrait generation module, and obtaining the prediction features includes:
inputting the box data into each regression tree in the tree model respectively to obtain a predictor characteristic corresponding to each regression tree;
and carrying out combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics.
Optionally, each regression tree in the plurality of regression trees includes one or more prediction nodes therein;
performing combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics, wherein the combined coding characteristics comprise:
and carrying out unique hot coding according to the predicted value of each predicted node in each regression tree to obtain the combined coding feature.
Optionally, the method comprises:
preprocessing original user data corresponding to a target user to obtain preprocessed first user data, wherein the preprocessing operation comprises one or more of repeated value processing, missing value processing and abnormal value processing;
and performing feature derivation and feature engineering on the first user data to obtain the user data of the target user, wherein the feature derivation and feature engineering comprise one or more of data summarization processing, data statistics processing and digital tag coding processing.
Optionally, the user representation generation module is trained according to the following:
acquiring a sample data set, wherein the sample data set comprises sample user data and a pre-labeled sample user label corresponding to the sample user data;
performing box separation processing on the sample user data based on a histogram algorithm to obtain sample box separation data;
constructing the tree model according to the sample binning data and the sample user label;
inputting the sample box data into the tree model to obtain sample prediction characteristics;
and constructing the logistic regression model according to the sample prediction characteristics and the sample user labels to obtain a trained user portrait generation module.
Optionally, the constructing the tree model according to the sample binning data and the sample user tag includes:
according to first sample binning data and a first sample user tag corresponding to the first sample binning data, constructing a first regression tree in the tree model, and determining a first residual error of the first regression tree;
and performing multiple iterations according to the sample binning data, the sample user label and the first residual error, and constructing other regression trees in the tree model to obtain the tree model.
Optionally, the performing multiple iterations according to the sample binning data, the sample user label, and the first residual error, and constructing other regression trees in the tree model to obtain the tree model includes:
repeating the step of constructing and obtaining the regression tree of the next iteration cycle according to the second sample binning data corresponding to the next iteration cycle, the second sample user label corresponding to the second sample binning data and the residual error of the regression tree corresponding to the previous iteration cycle;
and stopping iteration under the condition of meeting a preset iteration stopping condition to obtain the tree model.
Optionally, the method comprises:
determining that the preset iteration stopping condition is met under the condition that the number of regression trees in the tree model reaches a preset number threshold; and/or the presence of a gas in the gas,
and determining that the preset iteration stopping condition is met under the condition that the residual error of the regression tree in the current iteration period is less than or equal to a preset residual error threshold value.
According to a second aspect of the embodiments of the present disclosure, there is provided a user representation generation apparatus, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user data of a target user, and the user data comprises transaction data of the target user;
the box separating module is used for carrying out box separating processing on the user data based on a histogram algorithm to obtain box separating data;
the prediction module is used for inputting the box data into a tree model formed by a plurality of regression trees in the user portrait generation module obtained through pre-training to obtain prediction characteristics;
and the generating module is used for inputting the prediction characteristics into a logistic regression model in the user portrait generating module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information.
According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising a processor and a non-volatile memory storing computer instructions, which when executed by the processor, perform the user representation generation method of any one of the first aspects of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the user data are subjected to box dividing processing through a histogram algorithm, optimal segmentation points are searched in a traversing mode according to discrete values of a feature histogram, continuous features are high in dividing capacity and generalization capacity, the continuous features can be effectively processed, compared with the traditional manual box dividing operation, the efficiency is higher, the accidental possibility of manual operation is avoided, in addition, the box dividing data are processed through a tree model formed based on a multi-regression tree, the nonlinear relation of the features can be effectively mined, the problems of feature selection and feature intersection are effectively solved, and the finally generated user portrait for representing the user credit information can be more accurate.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the invention and are therefore not to be considered limiting of its scope. For a person skilled in the art, it is possible to derive other relevant figures from these figures without inventive effort.
FIG. 1 is a flow diagram illustrating a user representation generation method in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a method of training a user representation generation module in accordance with an exemplary embodiment;
FIG. 3 is another flow diagram illustrating a method of training a user representation generation module in accordance with an exemplary embodiment;
FIG. 4 is a schematic diagram of a user representation generation module shown in accordance with an exemplary embodiment;
FIG. 5 is a block diagram of a user representation generation apparatus in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating a configuration of a computer device, according to an example embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Furthermore, the terms "first," "second," and the like are used solely to distinguish one from another, and are not to be construed as indicating or implying relative importance.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
Fig. 1 is a flowchart illustrating a user representation generation method according to an exemplary embodiment, where the method may be applied to a computer device, which may be a user terminal, a server, or the like, and this is not particularly limited in this disclosure, and as shown in fig. 1, the method includes:
s101, obtaining user data of a target user, wherein the user data comprises transaction data of the target user.
The transaction data in the user data may specifically include, but is not limited to, data such as user basic information, deposit flow, financial transaction flow, credit investigation information, and the like.
It should be noted that the acquisition of the user data is performed only when the user authorization permission and/or the financial institution authorization permission is obtained, and the function of the execution subject for acquiring the user data may be disabled when the user authorization permission is not obtained. The user data obtained may be used only for generating a user representation, for example, by storing the user data in a file sandbox after the user data is obtained, and deleting the user data from the file sandbox after the user representation generation is determined to be completed.
And S102, performing box separation processing on the user data based on a histogram algorithm to obtain box separation data.
In step S102, the maximum binning number of the shared data may be 256, the user data is binned based on a histogram algorithm, a continuous variable may be discretized, and an optimal segmentation point is found according to the binning discrete value when the tree model is split in the subsequent step S103, so that the calculation speed can be greatly increased.
S103, inputting the box data into a tree model composed of a plurality of regression trees in a user portrait generation module obtained through pre-training to obtain prediction characteristics.
Wherein the tree model may be a LightGBM model. Each regression tree may include at least one node, each node may be used to encode a feature, and the generation process of each regression tree may be a standard regression tree generation process, so that the splitting of each node is an optimization process of feature selection, and the resulting predicted features include cross information of features of user data in multiple layers of nodes.
In some optional embodiments, the histogram algorithm in step S102 may be integrated into the tree model.
And S104, inputting the prediction characteristics into a logistic regression model in the user portrait generation module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information.
The logistic regression model may be an L1 regularized logistic regression model, that is, L1 loss may be used in the training process of the logistic regression model. The user tags may include a high-risk tag, a low-risk tag, or a no-risk tag, etc., so that the relevant organization may determine whether to handle the corresponding business for the target user based on the tags.
It will be appreciated that a user representation may be composed of a plurality of user tags, based on which the user representation may be enabled to characterize user credit information, e.g., if the user representation includes a high risk tag, the user credit may be determined to be low based on the user representation.
In the embodiment of the disclosure, the user data is subjected to binning processing through a histogram algorithm, an optimal segmentation point is searched in a traversing manner according to a discrete value of a feature histogram, the continuous feature is high in partitioning capacity and generalization capacity, the continuous feature can be effectively processed, compared with the traditional manual binning operation, the efficiency is higher, the accidental possibility of manual operation is avoided, the binning data is processed based on a tree model formed by multiple regression trees, the nonlinear relation of features can be effectively mined, the problems of feature selection and feature intersection are effectively solved, and the finally generated user portrait for representing the user credit information is more accurate.
In some optional embodiments, the pre-training the binned data input to obtain a tree model composed of a plurality of regression trees in a user representation generation module, the obtaining the predicted features includes:
inputting the box data into each regression tree in the tree model respectively to obtain a predictor characteristic corresponding to each regression tree; and carrying out combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics.
It can be understood that, the parameters and the node number of each regression tree in the tree model may be different, and thus the predictor characteristics corresponding to the sample data output may be completely different.
By adopting the scheme, the prediction characteristics of the multiple regression tree prediction values are fused by collecting the prediction sub-characteristics of the corresponding user data output by each regression tree and carrying out combined coding on the prediction sub-characteristics, so that the prediction information of each regression tree can be ensured to be applied to the logistic regression model, the user portrait generation model can not only mine the information of the nonlinear characteristics, but also has the interpretability of the linear model, and the overall prediction performance of the user portrait generation module is effectively improved.
Further, each regression tree in the plurality of regression trees includes one or more prediction nodes;
performing combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics, wherein the combined coding characteristics comprise:
and carrying out one-hot coding according to the predicted value of each predicted node in each regression tree to obtain the combined coding feature.
It will be appreciated that each prediction node in the regression tree may represent a decision path.
Exemplarily, if a certain regression tree a comprises three predicted nodes, each being 1A. Age > =30 years old; age <30 years and income <15000 yuan; age <30 years and income > =15000 yuan. Another decision tree B includes two predicted nodes, each 1B. Age >50 years; age < =50 years and liability <200000 yuan. If a certain sample falls on the prediction node 2A in the regression tree a and falls on the prediction node 2B in the regression tree B, the code for the predictor feature of the regression tree a may be [0,1,0], the code for the predictor feature of the regression tree B may be [0,1], and after One-hot (One-hot) encoding, a combined code feature of [0,1,0,0,1] may be obtained.
By adopting the mode, the prediction values of the regression trees aiming at the samples can be effectively fused by carrying out heat coding according to the prediction values of the prediction nodes in each regression tree, so that the user portrait generation model can not only mine the information of the nonlinear characteristics, but also has the interpretability of a linear model, and the overall prediction performance of the user portrait generation module is effectively improved.
In some optional embodiments, the method comprises:
preprocessing original user data corresponding to a target user to obtain preprocessed first user data, wherein the preprocessing operation comprises one or more of repeated value processing, missing value processing and abnormal value processing;
and performing feature derivation and feature engineering on the first user data to obtain the user data of the target user, wherein the feature derivation and feature engineering comprise one or more of data summarization processing, data statistics processing and digital tag coding processing.
For example, the data summarizing process and the data statistics process may be to count and summarize the transfer-in amount and the transfer-out amount of the preprocessed large amount of original pipeline data in about 6 months, so as to obtain the user data. The digital tag encoding process may be a label encoding process for a non-numerical variable.
By adopting the scheme, the data can be effectively removed from the original data, so that the obtained user portrait is more accurate, and loss caused by inaccurate user portrait construction is avoided.
FIG. 2 is a flowchart illustrating a method of training of a user representation generation module, according to an exemplary embodiment. The main body of execution of the method may be the same as or different from the main body of execution of the method shown in fig. 1, and the present disclosure is not limited thereto. For example, the execution subject of the method shown in fig. 1 may be a user terminal, and the execution subject of the method may be a server, that is, after the server has trained the user representation generation module, the server sends the user representation generation module to the user terminal, so that the user terminal executes the steps of the method shown in fig. 1 according to the model.
As shown in fig. 2, the method includes:
s201, obtaining a sample data set, wherein the sample data set comprises sample user data and a pre-labeled sample user label corresponding to the sample user data.
S202, performing box separation processing on the sample user data based on a histogram algorithm to obtain sample box separation data.
S203, constructing the tree model according to the sample binning data and the sample user label.
And S204, inputting the sample box data into the tree model to obtain sample prediction characteristics.
S205, the logistic regression model is built according to the sample prediction features and the sample user labels, and a trained user portrait generation module is obtained.
In step S205, the logistic regression model constructed according to the L1 regularization loss may be used.
In some possible embodiments, the method further comprises: obtaining a verification sample set, wherein the verification sample set can comprise verification user data and a pre-labeled verification user label corresponding to the verification user data; after step S205, the user representation generation module is verified according to the verified user data and the verified user tag, and parameters in the user representation generation module are adjusted according to a verification result, so as to obtain a verified user representation generation module. Or, in the case that the verification result does not satisfy the preset condition, re-executing the above S201 to S205, and re-performing the verification until the verification result satisfies the preset condition.
In some optional embodiments, said constructing said tree model from said sample binning data and said sample user tags comprises:
according to first sample binning data and a first sample user tag corresponding to the first sample binning data, constructing a first regression tree in the tree model, and determining a first residual error of the first regression tree;
and performing multiple iterations according to the sample binning data, the sample user label and the first residual error, and constructing other regression trees in the tree model to obtain the tree model.
It is to be understood that the first sample bin data may be a sample set in any bin of the sample bin data, and the first sample user tag is a tag corresponding to the sample set. Or numbering each box in the box data, wherein the first sample box data is not numbered as the box data corresponding to the box with the number of 1.
The first residual may include a residual corresponding to each sample, or may be a residual corresponding to any sample, where the residual is a prediction error of a sample corresponding to the first regression tree, and then, the residual is used as a fitting target to construct a second regression tree, and the residual of the second regression tree is used as a fitting target of a third regression tree to construct a third regression tree, and so on.
Specifically, the performing multiple iterations according to the sample binning data, the sample user label, and the first residual error, and constructing other regression trees in the tree model to obtain the tree model includes:
repeating the step of constructing and obtaining the regression tree of the next iteration cycle according to the second sample binning data corresponding to the next iteration cycle, the second sample user label corresponding to the second sample binning data and the residual error of the regression tree corresponding to the previous iteration cycle;
and stopping iteration under the condition of meeting a preset iteration stopping condition to obtain the tree model.
It can be understood that the nth regression tree in the tree model is constructed according to the nth set of sample binning data, the user labels corresponding to the sample binning data, and the residuals of the N-1 th regression tree.
By adopting the scheme, the first regression tree is constructed and at least based on the first residual error of the first regression tree, and the multiple regression trees are iterated, so that the residual error can be corrected through multiple iterations, the integral residual error of the tree model is reduced, and the accuracy of the user label generated by the user portrait generation module is improved.
In other optional embodiments, the method comprises:
determining that the preset iteration stopping condition is met under the condition that the number of regression trees in the tree model reaches a preset number threshold; and/or the presence of a gas in the gas,
and determining that the preset iteration stopping condition is met under the condition that the residual error of the regression tree in the current iteration period is less than or equal to a preset residual error threshold value.
The preset number threshold may be 100, the preset residual threshold may be 0, the disclosure does not limit the specific numerical value thereof, and related technical personnel may calibrate according to actual requirements thereof.
That is, the iteration may be stopped to obtain the corresponding tree model when 100 regression trees are constructed, or the iteration may be stopped to obtain the corresponding tree model when the residual error of the nth regression tree is 0.
By adopting the scheme, the model training can be effectively restrained by setting the preset quantity threshold value and/or the preset residual error threshold value, the phenomenon that the calculated quantity of the construction of the user image is too large due to the excessive quantity of the regression trees in the tree model is avoided, and the waste of calculation power due to infinite iteration caused by the fact that the tree model cannot be converged can be avoided.
Based on the above inventive concept, the present disclosure also provides another flowchart of a training method of a user portrait generation module as shown in fig. 3 according to an exemplary embodiment, wherein the user portrait generation module may be as shown in fig. 4, referring to fig. 4, the user portrait generation module includes a tree model, and a logistic regression model, wherein the tree model further includes therein a histogram algorithm and a one-hot encoding algorithm, the tree model may be a LightGBM model, the tree model is composed of a plurality of regression trees, i.e., regression trees a to N, each regression tree may include at least one prediction node below a root node, and in fig. 4, the prediction node is represented by a circle. In other embodiments, the histogram algorithm may be built into the LightGBM model.
As shown in fig. 3, the method includes:
s301, acquiring user original data and a label corresponding to the user original data.
S302, preprocessing and characteristic engineering are carried out on the original data of the user, and original derived characteristics are obtained and used as a sample data set.
Wherein the preprocessing comprises one or more of repeated value processing, missing value processing, and outlier processing; the feature engineering includes one or more of a data summarization process, a data statistics process, and a digital label encoding process.
S303, dividing the sample data set into a training set and a test set.
Wherein, can be according to seven tenths training set, three tenths test set division.
And S304, performing box separation processing on the sample set based on a histogram algorithm to obtain sample box separation data.
S305, constructing a tree model according to the sample box data and the corresponding labels.
And S306, carrying out one-hot coding according to the prediction result of the tree model to obtain combined coding characteristics.
And S307, establishing a logic return model according to the combined coding features and the label corresponding to the sample binning data.
Wherein, the logistic regression model constructed according to the L1 regularization loss can be used.
And S308, testing the tree model and the logistic regression model according to the test set, adjusting according to the test result, and combining the tree model and the logistic regression model to obtain the trained user portrait generation module.
The user portrait generation module obtained based on the mode has the following advantages:
in the user portrait generation module composite model composed of a plurality of regression trees, the generation process of each tree is a standard regression tree generation process, so that the splitting of each node is an optimization process of feature selection, and the final prediction node contains the cross information of the features of the sample in the multilayer nodes, so that the nonlinear relation of the features can be mined, and the problems of feature selection and feature cross combination are solved very efficiently.
By adopting a histogram algorithm, the optimal segmentation point is searched in a traversing mode according to the discrete value of the feature histogram, the continuous feature division capability is strong, the generalization capability is strong, the logistic regression model can be helped to process the continuous features, the efficiency is higher compared with that of the traditional manual binning operation, and the accidental manual operation is avoided.
The prediction result of the tree model is input into the regularized logistic regression model after being coded, so that the user portrait generation model not only mines the information of nonlinear characteristics, but also has the interpretability of the linear model, and the prediction performance is superior to that of a single logistic regression model.
FIG. 5 is a block diagram illustrating a user representation generation apparatus in accordance with an exemplary embodiment, and as shown in FIG. 5, the user representation generation apparatus 50 includes:
an obtaining module 51, configured to obtain user data of a target user, where the user data includes transaction data of the target user;
a binning module 52, configured to perform binning processing on the user data based on a histogram algorithm to obtain binning data;
a prediction module 53, configured to input the binning data into a tree model composed of multiple regression trees in a user portrait generation module obtained through pre-training, so as to obtain a prediction feature;
a generating module 54 for inputting the predicted features into a logistic regression model in the user representation generating module to obtain a user label corresponding to the target user, so as to generate a user representation capable of at least representing user credit information.
Optionally, the prediction module 53 is configured to:
inputting the box data into each regression tree in the tree model respectively to obtain a predictor characteristic corresponding to each regression tree;
and carrying out combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics.
Optionally, each regression tree in the plurality of regression trees includes one or more prediction nodes;
the prediction module 53 is further configured to:
and carrying out one-hot coding according to the predicted value of each predicted node in each regression tree to obtain the combined coding feature.
Optionally, the user representation generating device 50 is configured to:
preprocessing original user data corresponding to a target user to obtain preprocessed first user data, wherein the preprocessing operation comprises one or more of repeated value processing, missing value processing and abnormal value processing;
and performing feature derivation and feature engineering on the first user data to obtain the user data of the target user, wherein the feature derivation and feature engineering comprise one or more of data summarization processing, data statistics processing and digital tag coding processing.
Optionally, the user representation generating device 50 is configured to:
acquiring a sample data set, wherein the sample data set comprises sample user data and a pre-labeled sample user label corresponding to the sample user data;
performing box separation processing on the sample user data based on a histogram algorithm to obtain sample box separation data;
constructing the tree model according to the sample binning data and the sample user label;
inputting the sample box data into the tree model to obtain sample prediction characteristics;
and constructing the logistic regression model according to the sample prediction characteristics and the sample user labels to obtain a trained user portrait generation module.
Optionally, the user representation generating device 50 is configured to:
according to first sample binning data and a first sample user tag corresponding to the first sample binning data, constructing a first regression tree in the tree model, and determining a first residual error of the first regression tree;
and performing multiple iterations according to the sample binning data, the sample user label and the first residual error, and constructing other regression trees in the tree model to obtain the tree model.
Optionally, the user representation generating device 50 is configured to:
repeating the step of constructing and obtaining the regression tree of the next iteration cycle according to the second sample binning data corresponding to the next iteration cycle, the second sample user label corresponding to the second sample binning data and the residual error of the regression tree corresponding to the previous iteration cycle;
and stopping iteration under the condition of meeting a preset iteration stopping condition to obtain the tree model.
Optionally, the user representation generating device 50 is configured to:
determining that the preset iteration stopping condition is met under the condition that the number of regression trees in the tree model reaches a preset number threshold; and/or the presence of a gas in the gas,
and determining that the preset iteration stopping condition is met under the condition that the residual error of the regression tree in the current iteration period is less than or equal to a preset residual error threshold value.
It should be noted that, as for the implementation principle of the user image generating apparatus 50, the implementation principle of the user image generating method can be referred to, and details are not repeated herein. It should be understood that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these modules can all be implemented in the form of software invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the user image generating device may be a separate processing element, or may be implemented by being integrated into a chip of the device, or may be stored in a memory of the device in the form of program code, and the processing element of the device may call and execute the functions of the user image generating method. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. As another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
An embodiment of the present invention provides a computer device 100, where the computer device 100 includes a processor and a non-volatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device 100 executes the user portrait generation method. As shown in fig. 6, fig. 6 is a block diagram of a computer device 100 according to an embodiment of the present invention. Computer device 100 includes user representation generation apparatus 50, memory 111, processor 112, and communication unit 113.
To facilitate the transfer or interaction of data, the elements of the memory 111, the processor 112 and the communication unit 113 are electrically connected to each other, directly or indirectly. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. User representation generating device 50 includes at least one software function module that may be stored in memory 111 in the form of software or firmware (firmware) or may be resident in an Operating System (OS) of computer device 100. The processor 112 is used to execute a user image generation method stored in the memory 111, such as a software function module and a computer program included in the user image generation device 50.
The embodiment of the invention provides a readable storage medium, which comprises a computer program, and when the computer program runs, the computer program controls computer equipment where the readable storage medium is located to execute the user portrait generation method.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated. The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A user portrait generation method, comprising:
acquiring user data of a target user, wherein the user data comprises transaction data of the target user;
performing box separation processing on the user data based on a histogram algorithm to obtain box separation data;
inputting the box data into a tree model consisting of a plurality of regression trees in a user portrait generation module obtained by pre-training to obtain a prediction characteristic;
and inputting the predicted features into a logistic regression model in the user portrait generation module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information.
2. The method of claim 1, wherein the pre-training of the binned data input results in a tree model consisting of a plurality of regression trees in a user representation generation module, and wherein the obtaining of predictive features comprises:
inputting the box data into each regression tree in the tree model respectively to obtain a predictor characteristic corresponding to each regression tree;
and carrying out combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics.
3. The method of claim 2, wherein each of the plurality of regression trees includes one or more prediction nodes therein;
performing combined coding on the predictor characteristics to obtain combined coding characteristics as the prediction characteristics, wherein the combined coding characteristics comprise:
and carrying out one-hot coding according to the predicted value of each predicted node in each regression tree to obtain the combined coding feature.
4. A method according to any of claims 1-3, characterized in that the method comprises:
preprocessing original user data corresponding to a target user to obtain preprocessed first user data, wherein the preprocessing operation comprises one or more of repeated value processing, missing value processing and abnormal value processing;
and performing feature derivation and feature engineering on the first user data to obtain the user data of the target user, wherein the feature derivation and feature engineering comprise one or more of data summarization processing, data statistics processing and digital tag coding processing.
5. The method of claim 1, wherein the user representation generation module is trained in accordance with:
acquiring a sample data set, wherein the sample data set comprises sample user data and a pre-labeled sample user label corresponding to the sample user data;
performing box separation processing on the sample user data based on a histogram algorithm to obtain sample box separation data;
constructing the tree model according to the sample binning data and the sample user label;
inputting the sample box data into the tree model to obtain sample prediction characteristics;
and constructing the logistic regression model according to the sample prediction characteristics and the sample user labels to obtain a trained user portrait generation module.
6. The method of claim 5, wherein the constructing the tree model from the sample bin data and the sample user tags comprises:
constructing a first regression tree in the tree model according to first sample binning data and a first sample user tag corresponding to the first sample binning data, and determining a first residual error of the first regression tree;
and performing multiple iterations according to the sample binning data, the sample user label and the first residual error, and constructing other regression trees in the tree model to obtain the tree model.
7. The method of claim 6, wherein the performing multiple iterations based on the sample binned data and the sample user labels, and the first residuals, and wherein constructing other regression trees in the tree model to obtain the tree model comprises:
repeating the step of constructing and obtaining the regression tree of the next iteration cycle according to the second sample binning data corresponding to the next iteration cycle, the second sample user label corresponding to the second sample binning data and the residual error of the regression tree corresponding to the previous iteration cycle;
and stopping iteration under the condition of meeting a preset iteration stopping condition to obtain the tree model.
8. The method of claim 7, wherein the method comprises:
determining that the preset iteration stopping condition is met under the condition that the number of regression trees in the tree model reaches a preset number threshold; and/or the presence of a gas in the gas,
and determining that the preset iteration stopping condition is met under the condition that the residual error of the regression tree in the current iteration period is less than or equal to a preset residual error threshold value.
9. A user representation generation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user data of a target user, and the user data comprises transaction data of the target user;
the box dividing module is used for carrying out box dividing processing on the user data based on a histogram algorithm to obtain box dividing data;
the prediction module is used for inputting the box data into a tree model composed of a plurality of regression trees in the pre-training user portrait generation module to obtain prediction characteristics;
and the generating module is used for inputting the prediction characteristics into a logistic regression model in the user portrait generating module to obtain a user label corresponding to the target user so as to generate a user portrait at least capable of representing user credit information.
10. A computer device comprising a processor and a non-volatile memory having computer instructions stored thereon, wherein when the computer instructions are executed by the processor, the computer device performs the user representation generation method of any of claims 1 to 8.
CN202211264457.4A 2022-10-17 2022-10-17 User portrait generation method and device and computer equipment Pending CN115329909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211264457.4A CN115329909A (en) 2022-10-17 2022-10-17 User portrait generation method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211264457.4A CN115329909A (en) 2022-10-17 2022-10-17 User portrait generation method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN115329909A true CN115329909A (en) 2022-11-11

Family

ID=83915425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211264457.4A Pending CN115329909A (en) 2022-10-17 2022-10-17 User portrait generation method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN115329909A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543986A (en) * 2018-11-16 2019-03-29 湖南数定智能科技有限公司 The pre- methods of risk assessment of prison convict three and system based on user's portrait
CN110674178A (en) * 2019-08-30 2020-01-10 阿里巴巴集团控股有限公司 Method and system for constructing user portrait label
CN112232944A (en) * 2020-09-29 2021-01-15 中诚信征信有限公司 Scoring card creating method and device and electronic equipment
CN112256881A (en) * 2020-12-21 2021-01-22 上海冰鉴信息科技有限公司 User information classification method and device
CN112926651A (en) * 2021-02-24 2021-06-08 苏州黑云智能科技有限公司 Enterprise credit assessment method and system
CN114820160A (en) * 2022-03-31 2022-07-29 度小满科技(北京)有限公司 Loan interest rate estimation method, device, equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543986A (en) * 2018-11-16 2019-03-29 湖南数定智能科技有限公司 The pre- methods of risk assessment of prison convict three and system based on user's portrait
CN110674178A (en) * 2019-08-30 2020-01-10 阿里巴巴集团控股有限公司 Method and system for constructing user portrait label
CN112232944A (en) * 2020-09-29 2021-01-15 中诚信征信有限公司 Scoring card creating method and device and electronic equipment
CN112256881A (en) * 2020-12-21 2021-01-22 上海冰鉴信息科技有限公司 User information classification method and device
CN112926651A (en) * 2021-02-24 2021-06-08 苏州黑云智能科技有限公司 Enterprise credit assessment method and system
CN114820160A (en) * 2022-03-31 2022-07-29 度小满科技(北京)有限公司 Loan interest rate estimation method, device, equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张丽娟: "基于机器学习的用户画像研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
杨游云: "《数据分析与决策技术丛书 Python广告数据挖掘与分析实战》", 31 March 2021, 机械工业出版社 *

Similar Documents

Publication Publication Date Title
CN108701250B (en) Data fixed-point method and device
WO2021151292A1 (en) Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium
CN115618269A (en) Big data analysis method and system based on industrial sensor production
CN113343677A (en) Intention identification method and device, electronic equipment and storage medium
CN110276609B (en) Business data processing method and device, electronic equipment and computer readable medium
CN113703739B (en) Cross-language fusion calculation method, system and terminal based on omiga engine
CN113886832A (en) Intelligent contract vulnerability detection method, system, computer equipment and storage medium
CN115329909A (en) User portrait generation method and device and computer equipment
Fard Determination of minimal cut sets of a complex fault tree
CN110765100A (en) Label generation method and device, computer readable storage medium and server
CN116955059A (en) Root cause positioning method, root cause positioning device, computing equipment and computer storage medium
US11715037B2 (en) Validation of AI models using holdout sets
Ozfirat A fuzzy event tree methodology modified to select and evaluate suppliers
CN115186738A (en) Model training method, device and storage medium
CN114358910A (en) Abnormal financial data processing method, device, equipment and storage medium
Levitin et al. Performance distribution of a fault-tolerant system in the presence of failure correlation
CN107436728A (en) Rule analysis result storage method, regular retrogressive method and device
CN112667569A (en) Feature method, system, computer device and computer-readable storage medium
CN111782813A (en) User community evaluation method, device and equipment
CN116383883B (en) Big data-based data management authority processing method and system
CN116383454B (en) Data query method of graph database, electronic equipment and storage medium
CN111968022B (en) Service number generation system and method based on JSON configuration mode
Coenen Advanced binary encoded matrix representation for rule base verification
CN117829904A (en) Investment decision prediction method, apparatus, device, storage medium and program product
CN116775981A (en) System recommendation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221111

RJ01 Rejection of invention patent application after publication