CN110796172A - Sample label determination method and device for financial data and electronic equipment - Google Patents

Sample label determination method and device for financial data and electronic equipment Download PDF

Info

Publication number
CN110796172A
CN110796172A CN201910921682.2A CN201910921682A CN110796172A CN 110796172 A CN110796172 A CN 110796172A CN 201910921682 A CN201910921682 A CN 201910921682A CN 110796172 A CN110796172 A CN 110796172A
Authority
CN
China
Prior art keywords
financial data
users
sample set
equation
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910921682.2A
Other languages
Chinese (zh)
Inventor
王鹏
高明宇
张潮华
郑彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qiyu Information Technology Co Ltd
Original Assignee
Beijing Qiyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qiyu Information Technology Co Ltd filed Critical Beijing Qiyu Information Technology Co Ltd
Priority to CN201910921682.2A priority Critical patent/CN110796172A/en
Publication of CN110796172A publication Critical patent/CN110796172A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a sample label determination method and device for financial data, electronic equipment and a computer readable medium. The method comprises the following steps: determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the user's hyper-sphere distance to a threshold to determine the user's exemplar labels, the exemplar labels comprising positive and negative exemplar labels. According to the financial data sample label determining method, the positive samples in the unclassified samples can be extracted, and the positive samples and the negative samples are accurately classified, so that the calculation effect and the calculation precision of the machine learning model are improved.

Description

Sample label determination method and device for financial data and electronic equipment
Technical Field
The present disclosure relates to the field of computer information processing, and in particular, to a method and an apparatus for determining a sample tag of financial data, an electronic device, and a computer-readable medium.
Background
Machine learning is now greatly developed in various artificial intelligence research fields, common machine learning models can be classified into supervised learning, unsupervised learning and reinforcement learning, and each category can be specifically classified into different algorithms. In most application scenes at present, people can conveniently find a machine learning model suitable for self problems. For the general application of the machine learning model, a user firstly determines the machine learning model of a certain category or algorithm, then the user inputs specific data according to a specific problem which the user wants to solve, the machine learning model establishes a specific task, then the machine learning model is trained through the specific data, and after the training is finished, the machine learning model suitable for a certain specific task is obtained. In general, even though the same algorithm of the machine learning model is used, the machine learning models trained with different data are completely different.
In general, a machine learning model needs to learn a positive sample and a negative sample, where the positive sample is a sample corresponding to a category that we want to correctly classify, for example, we need to classify a picture to determine whether the picture belongs to a car, and then the picture of the car is a positive sample during training, and the negative sample can select any other picture that is not a car in principle. However, for the financial field or other fields, the selection of the positive sample is easier, for example, in the financial field, the situation that only the positive sample and the unmarked sample are owned is often encountered. For example, when looking for potential default customers, it is only known which users are already breached (positive sample), but for the large number of users left (unmarked), it is currently clear whether or not are potentially breached users. If the unlabeled samples are directly used as the binary models in the negative sample training machine learning model, a large amount of error data is introduced due to the large number of positive samples in the unlabeled samples, and the final trained model may not be ideal in effect.
Therefore, there is a need for a new method, apparatus, electronic device, and computer readable medium for sample tag determination of financial data.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present disclosure provides a method, an apparatus, an electronic device, and a computer readable medium for determining a sample tag of financial data, which can extract a positive sample from an unclassified sample and accurately classify the positive sample and a negative sample, thereby improving a calculation effect and a calculation accuracy of a machine learning model.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, a method for determining a sample tag of financial data is provided, the method including: determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set, the positive sample set and the unclassified sample set each comprising financial data of the plurality of users; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the user's hyper-sphere distance to a threshold to determine the user's exemplar labels, the exemplar labels comprising positive and negative exemplar labels.
Optionally, determining the multi-dimensional financial data characteristic values of the plurality of users in the positive sample set and the unclassified sample set comprises: obtaining financial data for a plurality of users in the positive sample set and the unclassified sample set; and comparing the financial data of the users with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of the users.
Optionally, the comparing the financial data of the plurality of users with a preset multi-dimensional feature vector reference value to generate the multi-dimensional financial data feature values of the plurality of users includes: classifying the financial data of the users according to preset dimensionality to generate multi-dimensional financial data; and comparing the multi-dimensional financial data with a preset multi-dimensional characteristic vector reference value to generate multi-dimensional financial data characteristic values of a plurality of users.
Optionally, the method further comprises: determining a characteristic dimension of the financial data of the user; and determining the multi-dimensional feature vector reference value based on the numerical values of the financial data of a plurality of users of which the feature dimensions correspond to the dimensions.
Optionally, the generating a target hypersphere equation by the hypersphere equation and the multidimensional financial data feature values of the plurality of users in the positive sample set comprises: performing parameter training on the hypersphere equation through multi-dimensional financial data characteristic values of a plurality of users in the positive sample set to generate a target hypersphere equation.
Optionally, the parameter training of the hypersphere equation by the multidimensional financial data feature values of the plurality of users in the positive sample set to generate a target hypersphere equation comprises: constructing an initial hypersphere equation based on the hypersphere equation and multi-dimensional financial data characteristic values of a plurality of users in the positive sample set; determining a target relaxation variable threshold and an optimization target; solving the initial hyper-sphere equation through an optimization algorithm based on the target relaxation variable threshold and the optimization target to obtain an optimal solution of the initial hyper-sphere equation; and generating the target hypersphere equation according to the parameter of the hypersphere equation corresponding to the optimal solution.
Optionally, the optimization objective is: distances between the multi-dimensional financial data characteristic values of the plurality of users and the center point of the hyper-sphere equation are smaller than the radius of the hyper-sphere equation.
Optionally, respectively substituting the multi-dimensional financial data characteristic values of the user in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the user includes: respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation; and solving the object hypersphere equation by adopting a Lagrange dual method to obtain the hypersphere distance of the user.
Optionally, comparing the user's hyper-sphere distance to a threshold to determine the user's sample label comprises: when the distance of the hyper-sphere of the user is larger than the threshold value, determining that the sample label of the user is a negative sample label; and when the distance of the hyper-sphere of the user is smaller than or equal to the threshold value, determining that the sample label of the user is a positive sample label.
Optionally, the method further comprises: generating positive sample financial data from the positive sample set and the user financial data with the positive sample label; generating negative sample financial data through the user financial data with the negative sample label; and training a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.
According to an aspect of the present disclosure, there is provided a sample tag determination apparatus for financial data, the apparatus including: the system comprises a characteristic module, a data processing module and a data processing module, wherein the characteristic module is used for determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set, and the positive sample set and the unclassified sample set both comprise financial data of the plurality of users; the equation module is used for generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; the distance module is used for respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and a label module to compare the hyper-sphere distance of the user to a threshold to determine a sample label for the user, the sample label comprising a positive sample label and a negative sample label.
Optionally, the feature module comprises: a data unit for obtaining financial data of a plurality of users in the positive sample set and the unclassified sample set; and the comparison unit is used for comparing the financial data of the users with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of the users.
Optionally, the alignment unit includes: the classification subunit is used for classifying the financial data of the users according to preset dimensionality to generate multi-dimensional financial data; and the comparison subunit is used for comparing the multi-dimensional financial data with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of a plurality of users.
Optionally, the alignment unit further comprises: a dimension subunit for determining a characteristic dimension of the financial data of the user; and the reference subunit is used for determining the multi-dimensional feature vector reference value based on the numerical values of the financial data of a plurality of users of which the feature dimensions correspond to the dimensions.
Optionally, the equation module is further configured to perform parameter training on the hypersphere equation through multi-dimensional financial data feature values of a plurality of users in the positive sample set to generate a target hypersphere equation.
Optionally, the equation module comprises: the constructing unit is used for constructing an initial hypersphere equation based on the hypersphere equation and multi-dimensional financial data characteristic values of a plurality of users in the positive sample set; the parameter unit is used for determining a target relaxation variable threshold and an optimization target; the solving unit is used for solving the initial hypersphere equation through an optimization algorithm based on the target relaxation variable threshold and the optimization target to obtain the optimal solution of the initial hypersphere equation; and the equation unit is used for generating the target hyper-sphere equation through the parameters of the hyper-sphere equation corresponding to the optimal solution.
Optionally, the optimization objective is: distances between the multi-dimensional financial data characteristic values of the plurality of users and the center point of the hyper-sphere equation are smaller than the radius of the hyper-sphere equation.
Optionally, the distance module comprises: the substituting unit is used for respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation; and the distance unit is used for solving the target hypersphere equation by adopting a Lagrange dual method so as to obtain the hypersphere distance of the user.
Optionally, the tag module comprises: the negative sample unit is used for determining that the sample label of the user is a negative sample label when the distance of the hyper-sphere of the user is greater than the threshold value; and the positive sample unit is used for determining that the sample label of the user is a positive sample label when the distance of the hyper-sphere of the user is smaller than or equal to the threshold value.
Optionally, the method further comprises: a data module for generating positive sample financial data from the positive sample set and the user financial data with the positive sample label; generating negative sample financial data through the user financial data with the negative sample label; and a training module for training a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.
According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
According to the financial data sample label determination method, the financial data sample label determination device, the electronic equipment and the computer readable medium, multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set are determined; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the distance of the hyper-sphere of the user with a threshold value to determine a sample label of the user, so that a positive sample in an unclassified sample can be extracted, and the positive sample and a negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of a machine learning model.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
FIG. 1 is a flow diagram illustrating a sample tag determination method for financial data in accordance with an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating a sample tag determination method for financial data according to another exemplary embodiment.
FIG. 3 is a flow chart illustrating a method for sample tag determination of financial data according to another exemplary embodiment.
FIG. 4 is a flow chart illustrating a method for sample tag determination of financial data in accordance with another exemplary embodiment.
FIG. 5 is a block diagram illustrating a sample tag determination mechanism for financial data in accordance with an exemplary embodiment.
Fig. 6 is a block diagram illustrating a sample tag determination apparatus for financial data according to another exemplary embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 8 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
FIG. 1 is a flow diagram illustrating a sample tag determination method for financial data in accordance with an exemplary embodiment. The sample tag determination method 10 for financial data includes at least steps S102 to S108.
As shown in fig. 1, in S102, multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set, each of which includes financial data of the plurality of users, are determined.
The sample data may be financial data of a user of a financial service company, further, financial data of a user having a default record may be included in the positive sample set, and financial data of a user having no default record may be included in the unclassified sample set.
In one embodiment, the positive sample set may contain financial data of users who have made loan records, and the unclassified sample set may contain financial data of users who have not made loan records.
In other embodiments, the positive sample set may further include other financial data of the user with a certain determined financial characteristic, and the negative sample set may include financial data of a user without a certain financial characteristic occurring at present, which is not limited in this application.
In one embodiment, determining multi-dimensional financial data feature values for a plurality of users in the positive sample set and the unclassified sample set comprises: obtaining financial data for a plurality of users in the positive sample set and the unclassified sample set; and comparing the financial data of the users with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of the users.
In one embodiment, determining the multi-dimensional financial data characteristic values for the plurality of users in the positive sample set and the unclassified sample set further comprises: determining a characteristic dimension of the financial data of the user; and determining the multi-dimensional feature vector reference value based on the numerical values of the financial data of a plurality of users of which the feature dimensions correspond to the dimensions.
The details of "determining the multi-dimensional financial data characteristic values of a plurality of users in the positive sample set and the unclassified sample set" are set forth in the corresponding embodiment of fig. 3.
In S104, a target hypersphere equation is generated by the hypersphere equation and the multidimensional financial data feature values of the plurality of users in the positive sample set. The hypersphere equation may be parametrically trained, for example, by multidimensional financial data feature values of a plurality of users in the positive sample set to generate a target hypersphere equation.
The hypersphere, also called hypersphere, or n-dimensional sphere, is a generalization of a common sphere in any dimension. It is an n-dimensional manifold in an (n +1) -dimensional space. Specifically, the 0-dimensional sphere is two points on a straight line, the 1-dimensional sphere is a circle on a plane, and the 2-dimensional sphere is a common sphere in a three-dimensional space. A sphere above 2 dimensions is called a hypersphere.
More specifically, the parameters of a hypersphere can be defined as follows, an n-dimensional sphere with a center at the origin and a radius of unit length is called a unit n-dimensional sphere, and is denoted as Sn. Expressed by symbols, the following are:
Sn={x∈IRn+1:||x||=1};
a hypersphere is the surface or boundary of an n-dimensional sphere, which is one of n-dimensional manifolds.
For any natural number n, an n-dimensional sphere of radius r is defined as the set of all points in (n +1) -dimensional Euclidean space whose distance to a certain fixed point is equal to a constant r, where r can be any positive real number. It is an n-dimensional manifold in an (n +1) -dimensional space.
The details of "generating a target hypersphere equation by using the multidimensional financial data feature values and the hypersphere equations of a plurality of users in the positive sample set" will be described in the corresponding embodiment of fig. 4. The target hypersphere is a hypersphere that defines parameters for the radius of the hypersphere in space of different dimensions.
In S106, the multi-dimensional financial data feature values of the users in the unclassified sample set are respectively substituted into the target hypersphere equation to obtain the hypersphere distance of the users. Can include the following steps: respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation; and solving the object hypersphere equation by adopting a Lagrange dual method to obtain the hypersphere distance of the user.
As shown in fig. 2, it can be simply understood that a hyperplane is found to circle out a positive sample in a sample (a target hypersphere), and prediction is made by using the hyperplane, and a sample in the target hypersphere is considered as a positive sample.
Assuming the generated hypersphere parameter as center o and corresponding hypersphere radius r>0, the hypersphere volume v (r) is minimized, the center o is the linear combination of the support rows; data points x representing user features in all unclassified sample sets may be requirediThe distance to the center is strictly less than r. Simultaneously constructing a relaxation variable zeta with a penalty coefficient of CiThe optimization problem is as follows:
MinV(r)+C∑ζi
||xi-o||≤r+ζi,i=1,2,3,…m;
ζi≥0,i=1,2,3,…m。
and solving by adopting Lagrangian dual, and then solving the distance from z to the center.
In S108, the hyper-sphere distance of the user is compared to a threshold to determine a exemplar label for the user, the exemplar label comprising a positive exemplar label and a negative exemplar label. Can include the following steps: when the distance of the hyper-sphere of the user is larger than the threshold value, determining that the sample label of the user is a negative sample label; and when the distance of the hyper-sphere of the user is smaller than or equal to the threshold value, determining that the sample label of the user is a positive sample label.
And judging whether the data point z in the unmarked set is in the interior of the object hypersphere according to the result, and further determining the sample label of the user. If z is less than or equal to radius r, it is not an outlier, it is marked as a positive sample, and if z is greater than radius r, it is outside the hyper-sphere, it is a negative sample.
In one embodiment, further comprising generating positive sample financial data from the positive sample set and the user financial data with the positive sample label; generating negative sample financial data through the user financial data with the negative sample label; and training a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.
According to the sample label determination method of financial data disclosed by the invention, multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set are determined; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the distance of the hyper-sphere of the user with a threshold value to determine a sample label of the user, so that a positive sample in an unclassified sample can be extracted, and the positive sample and a negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of a machine learning model.
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
FIG. 3 is a flow chart illustrating a method for sample tag determination of financial data according to another exemplary embodiment. The flow shown in fig. 3 is a detailed description of S102 "determining multi-dimensional financial data characteristic values of a plurality of users in the positive sample set and the unclassified sample set" in the flow shown in fig. 2.
As shown in fig. 3, in S302, a characteristic dimension of the financial data of the user is determined. Financial characteristics of the user, such as income, age, gender, job category, etc., may be used, for example, as a characteristic dimension.
In S304, the multidimensional feature vector reference value is determined based on the numerical values of the financial data of the plurality of users whose feature dimensions correspond to the dimensions.
In one embodiment, such a characteristic as age may be represented in a one-hot manner, divided into several segments 0-10, 10-20, >100, and a representative value may be determined for each segment, which may be, for example, 0-10 with a representative value of 1,10-20 with a representative value of 2, and so on.
In one embodiment, for revenue-like features, an averaging method may be used, such as determining an average of the user's revenue in the positive sample set from a plurality of user financial data in the positive sample set.
In S306, financial data for a plurality of users in the positive sample set and the unclassified sample set is obtained.
In S308, the financial data of the multiple users are classified according to preset dimensions, and multi-dimensional financial data is generated.
In S310, the multidimensional financial data is compared with a preset multidimensional feature vector reference value to generate the multidimensional financial data feature values of a plurality of users.
The income of the user in the negative sample may be compared with the average value of the income of the user in the positive sample, and the characteristic value of the financial data of the dimension may be determined by a value such as variance or standard deviation.
The age of the user in the negative sample may also be compared to a standard age range, for example, to determine a characteristic value for the age dimension.
FIG. 4 is a flow chart illustrating a method for sample tag determination of financial data in accordance with another exemplary embodiment. The flow shown in fig. 4 is a detailed description of S204 "generating a target hypersphere equation by using the hypersphere equation and the multidimensional financial data feature values of a plurality of users in the positive sample set" in the flow shown in fig. 2,
as shown in fig. 4, in S402, an initial hyper-sphere equation is constructed based on the hyper-sphere equation and multi-dimensional financial data characteristic values of a plurality of users in the positive sample set.
In S404, a target slack variable threshold and an optimization target are determined. The optimization target is as follows: distances between the multi-dimensional financial data characteristic values of the plurality of users and the center point of the hyper-sphere equation are smaller than the radius of the hyper-sphere equation.
In S406, the initial hyper-sphere equation is solved by an optimization algorithm based on the target relaxation variable threshold and the optimization target, and an optimal solution of the initial hyper-sphere equation is obtained.
In S408, the target hyper-sphere equation is generated by the parameters of the hyper-sphere equation corresponding to the optimal solution.
As with the hypersphere formula above, an initial hypersphere equation can be constructed based on the hypersphere equation and the multidimensional financial data characteristic values of the multiple users in the positive sample set, an optimization goal is determined, and the hypersphere method is continuously solved through the threshold of the relaxation variables until an optimal solution that can meet the optimization goal is obtained. And generating the target hypersphere equation according to the parameter of the hypersphere equation corresponding to the optimal solution.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 5 is a block diagram illustrating a sample tag determination mechanism for financial data in accordance with an exemplary embodiment. As shown in fig. 5, the sample tag determination apparatus 50 for financial data includes: a features module 502, an equation module 504, a distance module 506, and a label module 508. Further comprising: a data module 510 and a training module 512.
The feature module 502 is configured to determine multi-dimensional financial data feature values of a plurality of users in a positive sample set and an unclassified sample set, each of which includes financial data of the plurality of users;
the equation module 504 is configured to generate a target hypersphere equation through the hypersphere equation and the multidimensional financial data feature values of the plurality of users in the positive sample set; the equation module 504 is further configured to perform parametric training on the hypersphere equation by multi-dimensional financial data feature values of a plurality of users in the positive sample set to generate a target hypersphere equation.
Wherein, the equation module 504 includes: the constructing unit is used for constructing an initial hypersphere equation based on the hypersphere equation and multi-dimensional financial data characteristic values of a plurality of users in the positive sample set; the parameter unit is used for determining a target relaxation variable threshold and an optimization target; the solving unit is used for solving the initial hypersphere equation through an optimization algorithm based on the target relaxation variable threshold and the optimization target to obtain the optimal solution of the initial hypersphere equation; and the equation unit is used for generating the target hyper-sphere equation through the parameters of the hyper-sphere equation corresponding to the optimal solution.
The optimization target is as follows: distances between the multi-dimensional financial data characteristic values of the plurality of users and the center point of the hyper-sphere equation are smaller than the radius of the hyper-sphere equation.
The distance module 506 is configured to respectively substitute the multi-dimensional financial data feature values of the users in the unclassified sample set into the target hypersphere equation to obtain hypersphere distances of the users; the distance module 506 includes: the substituting unit is used for respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation; and the distance unit is used for solving the target hypersphere equation by adopting a Lagrange dual method so as to obtain the hypersphere distance of the user.
The label module 508 is to compare the hyper-sphere distance of the user to a threshold to determine a exemplar label for the user, the exemplar label comprising a positive exemplar label and a negative exemplar label. The tag module 508 includes: the negative sample unit is used for determining that the sample label of the user is a negative sample label when the distance of the hyper-sphere of the user is greater than the threshold value; and the positive sample unit is used for determining that the sample label of the user is a positive sample label when the distance of the hyper-sphere of the user is smaller than or equal to the threshold value.
The data module 510 is used for generating positive sample financial data from the positive sample set and the user financial data with the positive sample label; generating negative sample financial data through the user financial data with the negative sample label; and
the training module 512 is configured to train a machine learning model with the positive sample financial data and the negative sample financial data to generate a user violation risk model.
Fig. 6 is a block diagram illustrating a sample tag determination apparatus for financial data according to another exemplary embodiment. As shown in fig. 6, the feature module 502 includes: a data unit 5022, a comparison unit 5024, a dimension unit 5026, and a reference unit 5028.
A data unit 5022 is used to obtain financial data for a plurality of users in the positive sample set and the unclassified sample set; and
the comparing unit 5024 is configured to compare the financial data of the multiple users with a preset multi-dimensional feature vector reference value to generate multi-dimensional financial data feature values of the multiple users.
Wherein, the comparing unit 50245024 includes: the classification subunit is used for classifying the financial data of the users according to preset dimensionality to generate multi-dimensional financial data; and the comparison subunit is used for comparing the multi-dimensional financial data with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of a plurality of users.
A dimensions unit 5026 is used to determine characteristic dimensions of the financial data of the user; and
the reference unit 5028 is configured to determine the multi-dimensional feature vector reference value based on the values of the financial data of the plurality of users whose feature dimensions correspond to the dimensions.
According to the sample label determination device of the financial data, multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set are determined; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the distance of the hyper-sphere of the user with a threshold value to determine a sample label of the user, so that a positive sample in an unclassified sample can be extracted, and the positive sample and a negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of a machine learning model.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 700 according to this embodiment of the disclosure is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 is embodied in the form of a general purpose computing device. The components of the electronic device 700 may include, but are not limited to: at least one processing unit 710, at least one memory unit 720, a bus 730 that connects the various system components (including the memory unit 720 and the processing unit 710), a display unit 740, and the like.
Wherein the storage unit stores program codes executable by the processing unit 710 to cause the processing unit 710 to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 710 may perform the steps as shown in fig. 1, 3, 4.
The memory unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)7201 and/or a cache memory unit 7202, and may further include a read only memory unit (ROM) 7203.
The memory unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 700 may also communicate with one or more external devices 700' (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 700, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 700 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. The network adapter 760 may communicate with other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 8, the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present disclosure.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set; generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation; respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and comparing the user's hyper-sphere distance to a threshold to determine the user's exemplar labels, the exemplar labels comprising positive and negative exemplar labels.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method for sample tag determination of financial data, comprising:
determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set, the positive sample set and the unclassified sample set each comprising financial data of the plurality of users;
generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation;
respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and
comparing the user's hyper-sphere distance to a threshold to determine the user's exemplar labels, the exemplar labels comprising positive and negative exemplar labels.
2. The method of claim 1, wherein determining multi-dimensional financial data characteristic values for a plurality of users in a positive sample set and an unclassified sample set comprises:
obtaining financial data for a plurality of users in the positive sample set and the unclassified sample set; and
and comparing the financial data of the users with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of the users.
3. The method of any one of claims 1-2, wherein comparing the financial data of the plurality of users to a pre-defined multi-dimensional feature vector reference value to generate the multi-dimensional financial data feature values of the plurality of users comprises:
classifying the financial data of the users according to preset dimensionality to generate multi-dimensional financial data; and
and comparing the multi-dimensional financial data with a preset multi-dimensional characteristic vector reference value to generate the multi-dimensional financial data characteristic values of a plurality of users.
4. The method of any one of claims 1-3, further comprising:
determining a characteristic dimension of the financial data of the user; and
and determining the multi-dimensional feature vector reference value based on the numerical values of the financial data of a plurality of users of which the feature dimensions correspond to the dimensions.
5. The method of any one of claims 1-4, wherein generating a target hypersphere equation from the hypersphere equation and multidimensional financial data feature values of the plurality of users in the positive sample set comprises:
performing parameter training on the hypersphere equation through multi-dimensional financial data characteristic values of a plurality of users in the positive sample set to generate a target hypersphere equation.
6. The method of any one of claims 1-5, wherein parametrically training the hypersphere equation to generate a target hypersphere equation by multidimensional financial data feature values for a plurality of users in the positive sample set comprises:
constructing an initial hypersphere equation based on the hypersphere equation and multi-dimensional financial data characteristic values of a plurality of users in the positive sample set;
determining a target relaxation variable threshold and an optimization target;
solving the initial hyper-sphere equation through an optimization algorithm based on the target relaxation variable threshold and the optimization target to obtain an optimal solution of the initial hyper-sphere equation; and
and generating the target hypersphere equation according to the parameter of the hypersphere equation corresponding to the optimal solution.
7. The method of any of claims 1-6, wherein the optimization objective is: distances between the multi-dimensional financial data characteristic values of the plurality of users and the center point of the hyper-sphere equation are smaller than the radius of the hyper-sphere equation.
8. An apparatus for sample tag determination of financial data, comprising:
the system comprises a characteristic module, a data processing module and a data processing module, wherein the characteristic module is used for determining multi-dimensional financial data characteristic values of a plurality of users in a positive sample set and an unclassified sample set, and the positive sample set and the unclassified sample set both comprise financial data of the plurality of users;
the equation module is used for generating a target hypersphere equation through the multi-dimensional financial data characteristic values of the users in the positive sample set and the hypersphere equation;
the distance module is used for respectively substituting the multi-dimensional financial data characteristic values of the users in the unclassified sample set into the target hypersphere equation to obtain the hypersphere distance of the users; and
a label module to compare the hyper-sphere distance of the user to a threshold to determine a sample label for the user, the sample label comprising a positive sample label and a negative sample label.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201910921682.2A 2019-09-27 2019-09-27 Sample label determination method and device for financial data and electronic equipment Pending CN110796172A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910921682.2A CN110796172A (en) 2019-09-27 2019-09-27 Sample label determination method and device for financial data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910921682.2A CN110796172A (en) 2019-09-27 2019-09-27 Sample label determination method and device for financial data and electronic equipment

Publications (1)

Publication Number Publication Date
CN110796172A true CN110796172A (en) 2020-02-14

Family

ID=69439878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910921682.2A Pending CN110796172A (en) 2019-09-27 2019-09-27 Sample label determination method and device for financial data and electronic equipment

Country Status (1)

Country Link
CN (1) CN110796172A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium
CN113204603A (en) * 2021-05-21 2021-08-03 中国光大银行股份有限公司 Method and device for marking categories of financial data assets

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101907681A (en) * 2010-07-15 2010-12-08 南京航空航天大学 Analog circuit dynamic online failure diagnosing method based on GSD-SVDD
CN107563431A (en) * 2017-08-28 2018-01-09 西南交通大学 A kind of image abnormity detection method of combination CNN transfer learnings and SVDD

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101907681A (en) * 2010-07-15 2010-12-08 南京航空航天大学 Analog circuit dynamic online failure diagnosing method based on GSD-SVDD
CN107563431A (en) * 2017-08-28 2018-01-09 西南交通大学 A kind of image abnormity detection method of combination CNN transfer learnings and SVDD

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
裔阳等: "基于正样本和未标记样本的遥感图像分类方法" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium
CN113204603A (en) * 2021-05-21 2021-08-03 中国光大银行股份有限公司 Method and device for marking categories of financial data assets
CN113204603B (en) * 2021-05-21 2024-02-02 中国光大银行股份有限公司 Category labeling method and device for financial data assets

Similar Documents

Publication Publication Date Title
Lu et al. Machine learning for synthetic data generation: a review
US12061966B2 (en) Relevance score assignment for artificial neural networks
WO2021114974A1 (en) User risk assessment method and apparatus, electronic device, and storage medium
WO2018196760A1 (en) Ensemble transfer learning
Franc et al. An iterative algorithm learning the maximal margin classifier
CN109284371B (en) Anti-fraud method, electronic device, and computer-readable storage medium
Sarkar et al. Ensemble Machine Learning Cookbook: Over 35 practical recipes to explore ensemble machine learning techniques using Python
CN112395487B (en) Information recommendation method and device, computer readable storage medium and electronic equipment
CN108898181B (en) Image classification model processing method and device and storage medium
CN110781922A (en) Sample data generation method and device for machine learning model and electronic equipment
CN110796482A (en) Financial data classification method and device for machine learning model and electronic equipment
Terven et al. Loss functions and metrics in deep learning. A review
Bonaccorso Hands-on unsupervised learning with Python: implement machine learning and deep learning models using Scikit-Learn, TensorFlow, and more
CN110751190A (en) Financial risk model generation method and device and electronic equipment
CN110796172A (en) Sample label determination method and device for financial data and electronic equipment
Joshi et al. Python: Real world machine learning
Wan et al. A hidden semi-Markov model for chart pattern matching in financial time series
Lee et al. Effective evolutionary multilabel feature selection under a budget constraint
CN110796170A (en) Client dynamic support model generation method and device and electronic equipment
Joshi Python machine learning cookbook
Foumani et al. Series2vec: similarity-based self-supervised representation learning for time series classification
CN112102062A (en) Risk assessment method and device based on weak supervised learning and electronic equipment
Guo et al. Deciphering feature effects on decision-making in ordinal regression problems: an explainable ordinal factorization model
CN114595787A (en) Recommendation model training method, recommendation device, medium and equipment
US11586520B2 (en) Automated data linkages across datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination