CN110796171A

CN110796171A - Unclassified sample processing method and device of machine learning model and electronic equipment

Info

Publication number: CN110796171A
Application number: CN201910921670.XA
Authority: CN
Inventors: 王鹏; 高明宇; 张潮华; 郑彦
Original assignee: Beijing Qilu Information Technology Co Ltd
Current assignee: Beijing Qilu Information Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-02-14

Abstract

The disclosure relates to an unclassified sample processing method and device of a machine learning model, an electronic device and a computer readable medium. The method comprises the following steps: dividing an unclassified sample set of a machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result. The unclassified sample processing method, the unclassified sample processing device, the electronic equipment and the computer readable medium of the machine learning model can extract positive samples in the unclassified samples and accurately classify the positive samples and the negative samples, so that the calculation effect and the calculation precision of the machine learning model are improved.

Description

Unclassified sample processing method and device of machine learning model and electronic equipment

Technical Field

The present disclosure relates to the field of computer information processing, and in particular, to a method and an apparatus for processing an unclassified sample of a machine learning model, an electronic device, and a computer-readable medium.

Background

For the general application of the machine learning model, a user firstly determines the machine learning model of a certain category or algorithm, then the user inputs specific data according to a specific problem which the user wants to solve, the machine learning model establishes a specific task, then the machine learning model is trained through the specific data, and after the training is finished, the machine learning model suitable for a certain specific task is obtained. In general, even though the same algorithm of the machine learning model is used, the machine learning models trained with different data are completely different.

In general, the machine learning model needs to learn positive samples and negative samples, the positive samples are samples corresponding to correctly classified classes, and the negative samples can select any other samples that are not correctly classified in principle. But for the financial field or other fields, it is easier to select a positive sample, for example, in the financial field, when searching for potential default customers, users who have already violated can be used as a positive sample trained by the machine learning model, and for the remaining large number of users (users without records of violations), it is clear which users are possible to violate. If all users (unlabeled samples) having no default records at the current moment are directly used as negative samples to train the machine learning model, since a large number of positive samples are likely to exist in the unlabeled samples, a lot of error data are introduced to the machine learning model training, and finally the trained model is not ideal in effect.

Therefore, there is a need for a new unclassified sample processing method, apparatus, electronic device, and computer-readable medium for a machine learning model.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the present disclosure provides an unclassified sample processing method and apparatus for a machine learning model, an electronic device, and a computer readable medium, which can extract a positive sample from an unclassified sample and accurately classify the positive sample and a negative sample, thereby improving the computational effect and computational accuracy of the machine learning model.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, an unclassified sample processing method of a machine learning model is provided, the method including: dividing an unclassified sample set of a machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result.

Optionally, the method further comprises: after the similarity comparison is finished, generating positive sample financial data for the machine learning model through the positive sample subset and at least one positive sample subset; and generating negative-sample financial data for the machine learning model from the at least one negative-sample subset.

Optionally, the dividing the set of unclassified samples of the machine learning model into a plurality of unclassified subsets by a self-coding algorithm comprises: inputting a plurality of user financial data in the set of unclassified samples of a machine learning model into a classification model, respectively; the classification model performs encoding and decoding operations on each financial data in the plurality of user financial data to determine a classification label of the financial data; and generating the plurality of unclassified subsets from the user financial data having the same label; wherein the classification model is generated by a self-encoding algorithm.

Optionally, the step of performing an encoding and decoding operation on each of the plurality of user financial data by the classification model to determine the classification label thereof comprises: the classification model carries out coding operation on each piece of user financial data in the plurality of pieces of user financial data to generate a feature code; sequentially inputting the feature codes into a multilayer neural network structure of the classification model to carry out reconstruction processing to generate a low-dimensional feature value; and comparing the low-dimensional feature values to the step feature values in the classification model to determine classification labels for the user financial data.

Optionally, the dividing the set of unclassified samples of the machine learning model into a plurality of unclassified subsets by a self-coding algorithm further comprises: extracting part of user financial data from the positive sample set to generate training data; and training a self-coding algorithm model through the training data to generate the classification model.

Optionally, the extracting the part of the user financial data from the positive sample set to generate the training data comprises: extracting part of the user financial data from the positive sample set; and screening the multi-dimensional user characteristics of the user financial data to generate training data.

Optionally, the comparing the similarity of the positive sample set with each of the plurality of unclassified subsets comprises: and respectively carrying out similarity comparison on the positive sample set and each unclassified subset in the plurality of unclassified subsets through a minhash algorithm.

Optionally, the performing, by a minhash algorithm, similarity comparison between the positive sample set and each of the plurality of unclassified subsets comprises: determining a plurality of first hash values of the user financial data in the positive sample set according to a preset mode; determining a plurality of second hash values of the user financial data in the unclassified subset according to a preset mode; and comparing the plurality of first hash values with the plurality of second hash data to perform similarity comparison.

Optionally, determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result includes: generating a similarity numerical value according to a similarity comparison result of the users with preset ranks; when the similarity value is greater than or equal to a threshold value, determining the unclassified subset as a positive sample subset; and determining the unclassified subset as a negative sample subset when the similarity value is less than a threshold value.

Optionally, the method further comprises: training a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.

According to an aspect of the present disclosure, an unclassified sample processing apparatus of a machine learning model is provided, the apparatus including: the classification module is used for dividing an unclassified sample set of the machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data; a comparison module, configured to perform similarity comparison between a positive sample set and each of the plurality of unclassified subsets, where the positive sample set includes a plurality of user financial data; and the set module is used for determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result.

Optionally, the method further comprises: the positive sample module is used for generating positive sample financial data for the machine learning model through the positive sample subset and at least one positive sample subset after the similarity comparison is finished; and a negative examples module to generate negative examples financial data for the machine learning model from at least one negative examples subset.

Optionally, the classification module comprises: an input unit for inputting a plurality of user financial data in the set of unclassified samples of a machine learning model into a classification model, respectively; the tag unit is used for carrying out encoding and decoding operation on each financial data in the plurality of user financial data by the classification model so as to determine a classification tag of the financial data; and a subset unit for generating the plurality of unclassified subsets from the user financial data having the same label; wherein the classification model is generated by a self-encoding algorithm.

Optionally, the tag unit comprises: the characteristic subunit is used for carrying out coding operation on each piece of user financial data in the plurality of pieces of user financial data by the classification model to generate a characteristic code; the reconstruction subunit is used for sequentially inputting the feature codes into the multilayer neural network structure of the classification model for reconstruction processing to generate a low-dimensional feature value; and a comparison subunit, configured to compare the low-dimensional feature value with the step feature value in the classification model to determine a classification label of the user financial data.

Optionally, the classification module further comprises: the data unit is used for extracting part of user financial data from the positive sample set to generate training data; and the training unit is used for training a self-coding algorithm model through the training data to generate the classification model.

Optionally, the data unit is further configured to extract a part of the user financial data from the positive sample set; and screening the multi-dimensional user characteristics of the user financial data to generate training data.

Optionally, the comparing module is further configured to perform similarity comparison on the positive sample set and each of the plurality of unclassified subsets by using a minhash algorithm.

Optionally, the comparison module comprises: the sorting unit is used for determining a plurality of first hash values of the user financial data in the positive sample set according to a preset mode; determining a plurality of second hash values of the user financial data in the unclassified subset according to a preset mode; and a similarity unit for comparing the plurality of first hash values with the plurality of second hash data to perform similarity comparison.

Optionally, the aggregation module includes: the numerical value unit is used for generating a similarity numerical value according to the similarity comparison result of the users with preset ranks; a positive sample unit, configured to determine the unclassified subset as a positive sample subset when the similarity value is greater than or equal to a threshold; and a negative sample unit, configured to determine the unclassified subset as a negative sample subset when the similarity value is smaller than a threshold.

Optionally, the method further comprises: a model module to train a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.

According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.

According to the unclassified sample processing method, the unclassified sample processing device, the electronic equipment and the computer readable medium of the machine learning model, an unclassified sample set of the machine learning model is divided into a plurality of unclassified subsets through a self-coding algorithm; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result, so that a positive sample in the unclassified sample can be extracted, and the positive sample and the negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of the machine learning model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.

FIG. 1 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to another exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to another exemplary embodiment.

FIG. 4 is a block diagram illustrating an unclassified sample processing apparatus of a machine learning model according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 6 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.

FIG. 1 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to an exemplary embodiment. The unclassified sample processing method 10 of the machine learning model includes at least steps S102 to S106.

As shown in fig. 1, in S102, an unclassified sample set of the machine learning model is divided into a plurality of unclassified subsets by a self-coding algorithm, where the unclassified sample set includes a plurality of user financial data.

The sample data may be financial data of a user of a financial service company, further, financial data of a user having a default record may be included in the positive sample set, and financial data of a user having no default record may be included in the unclassified sample set.

In one embodiment, the positive sample set may contain financial data of users who have made loan records, and the unclassified sample set may contain financial data of users who have not made loan records.

In other embodiments, the positive sample set may further include other financial data of the user with a certain determined financial characteristic, and the negative sample set may include financial data of a user without a certain financial characteristic occurring at present, which is not limited in this application.

The method specifically comprises the following steps: inputting a plurality of user financial data in the set of unclassified samples of a machine learning model into a classification model, respectively; the classification model performs encoding and decoding operations on each financial data in the plurality of user financial data to determine a classification label of the financial data; and generating the plurality of unclassified subsets from the user financial data having the same label; wherein the classification model is generated by a self-encoding algorithm.

The self-coding algorithm is an unsupervised algorithm, features can be learned from unmarked data automatically, further better feature description is carried out on the original data, and the unmarked data are classified based on the non-feature description given by the self-coding algorithm.

In one embodiment, training data is generated by extracting part of the user financial data from the positive sample set; and training a self-coding algorithm model through the training data to generate the classification model.

Wherein, the step of extracting part of the user financial data from the positive sample set to generate training data comprises the following steps: extracting part of the user financial data from the positive sample set; and screening the multi-dimensional user characteristics of the user financial data to generate training data.

For example, the financial data of the user who has performed the record of default is included in the positive sample set, a part of the financial data of the user is extracted from the positive sample set by adopting a random extraction mode, the financial data of the user can include user features of multiple dimensions, in order to enable subsequent calculation to be fast and efficient, the user features of multiple dimensions can be screened, specifically, for example, features irrelevant to the default in the financial data of the user can be planed through past experience, and then training data are generated.

The user's registration time, login time, etc. features of the user on a financial website may be deleted from the user's financial data, and training data may be generated by the user's work, age, income, occupation, etc. features.

In S104, a positive sample set including a plurality of user financial data is compared with each of the plurality of unclassified subsets of similarity respectively. Can include the following steps: and respectively carrying out similarity comparison on the positive sample set and each unclassified subset in the plurality of unclassified subsets through a minhash algorithm.

In one embodiment, the performing, by a minhash algorithm, the similarity comparison of the positive sample set with each of the plurality of unclassified subsets comprises: determining a plurality of first hash values of the user financial data in the positive sample set according to a preset mode; determining a plurality of second hash values of the user financial data in the unclassified subset according to a preset mode; and comparing the plurality of first hash values with the plurality of second hash data to perform similarity comparison.

When there are fewer elements in the set, we can use a one-by-one comparison to find out the similarity between users in set a and users in set B. However, when there are many elements in a set, for example, each set has millions or tens of millions of elements, the adoption of the one-by-one comparison method with the complexity of o (n) takes a lot of time, and the use of the MinHash algorithm can reduce the complexity of the similarity comparison between 2 sets to a constant value. The specific calculation process will be described in detail in the embodiment corresponding to fig. 3.

In S106, the unclassified subset is determined as a positive sample subset or a negative sample subset according to the similarity comparison result. Can include the following steps: generating a similarity numerical value according to a similarity comparison result of the users with preset ranks; when the similarity value is greater than or equal to a threshold value, determining the unclassified subset as a positive sample subset; and determining the unclassified subset as a negative sample subset when the similarity value is less than a threshold value.

In one embodiment, further comprising: after the similarity comparison is finished, generating positive sample financial data for the machine learning model through the positive sample subset and at least one positive sample subset; and generating negative-sample financial data for the machine learning model from the at least one negative-sample subset.

In one embodiment, a machine learning model is trained on the positive sample financial data and the negative sample financial data to generate a user breach risk model. The positive sample financial data and the negative sample financial data can be regarded as accurate sample data, and the machine learning model is trained through the accurate sample data, so that a more accurate model training result can be obtained.

According to the unclassified sample processing method of the machine learning model, an unclassified sample set of the machine learning model is divided into a plurality of unclassified subsets through a self-coding algorithm; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result, so that a positive sample in the unclassified sample can be extracted, and the positive sample and the negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of the machine learning model.

It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

FIG. 2 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to another exemplary embodiment. The flow shown in fig. 2 is a detailed description of S102 "dividing the set of unclassified samples of the machine learning model into a plurality of unclassified subsets by the self-coding algorithm" in the flow shown in fig. 1.

As shown in fig. 2, in S202, the plurality of user financial data in the set of unclassified samples of the machine learning model are respectively input into the classification model. Wherein the classification model is generated by a self-encoding algorithm.

In one embodiment, further comprising: extracting part of user financial data from the positive sample set to generate training data; and training a self-coding algorithm model through the training data to generate the classification model.

The internal structure in the self-coding algorithm model may consist of two parts:

1) an encoder: this section can compress the input into a potential spatial representation, which can be represented by the coding function h ═ f (x).

2) A decoder: this section can reconstruct the input from the potential spatial representation, which can be represented by the decoding function r ═ g (h).

The basic structural unit of the self-coding algorithm model is an automatic coder, and the input features X are coded according to a certain rule and a training algorithm, and the original features are represented again by using a low-dimensional vector. The self-encoding algorithm model training method may include gradient descent, least squares, round robin algorithm, and the like.

In S204, the classification model performs an encoding and decoding operation on each of the plurality of user financial data to determine a classification tag thereof.

Wherein the classification model performing encoding and decoding operations on each of the plurality of user financial data to determine the classification label thereof comprises: the classification model carries out coding operation on each piece of user financial data in the plurality of pieces of user financial data to generate a feature code; sequentially inputting the feature codes into a multilayer neural network structure of the classification model to carry out reconstruction processing to generate a low-dimensional feature value; and comparing the low-dimensional feature values to the step feature values in the classification model to determine classification labels for the user financial data.

In S206, the plurality of unclassified subsets are generated from the user financial data having the same tag. The classification model calculates the user financial data, assigns classification labels to each user financial data, places the user financial data with the same labels in a set, and generates an unclassified subset. The number of the unclassified subsets may be multiple, or the classification model may be initially set in the operation process of the classification model, and the number of classifications is set, which is not limited in this disclosure.

FIG. 3 is a flow diagram illustrating a method of unclassified sample processing for a machine learning model according to another exemplary embodiment. The flow shown in fig. 3 is a detailed description of S204 "comparing the similarity of the positive sample set with each of the plurality of unclassified subsets" in the flow shown in fig. 1,

as shown in fig. 3, in S302, a plurality of first hash values of the user financial data in the positive sample set are determined in a preset manner.

In S304, a plurality of second hash values of the user financial data in the unclassified subset are determined in a preset manner.

In S306, the plurality of first hash values and the plurality of second hash data are compared to perform similarity comparison.

A series of predefined H (x) values may be invoked, and the first hash value of the user financial data in the positive sample set is calculated to obtain an array [ H1min (a), H2min (a), H3min (a), … … hnmin (a) ], which is stored as intermediate data.

The same predefined H (x) may also be invoked to calculate a second hash value for the user financial data in each unclassified subset to obtain an array [ H1min (B), H2min (B), H3min (B), … … hnmin (B) ], and store the array as intermediate data.

After the calculation is completed, when the similarity of the positive sample set (A) and the unclassified subset (B) is calculated, reading an array corresponding to each set by defining a random variable r as follows:

r＝1if Hmin(A)＝Hmin(B)else 0

and obtaining the similarity of the A and B sets as follows:

∑r/n

n may be set to 2048. Therefore, no matter how many user financial data exist in the positive sample set and the unclassified sample set, when the similarity comparison is carried out, the data are reduced into an array containing 2048 elements through a minhash mode, and then the similarity comparison is carried out, so that the speed of the similarity comparison between the positive sample set and the unclassified sample set is increased.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

FIG. 4 is a block diagram illustrating an unclassified sample processing apparatus of a machine learning model according to an exemplary embodiment. As shown in fig. 4, the unclassified sample processing apparatus 40 of the machine learning model includes: a classification module 402, a comparison module 404, an aggregation module 406, a positive examples module 408, and a negative examples module 410.

The classification module 402 is configured to classify an unclassified sample set of the machine learning model into a plurality of unclassified subsets by using a self-coding algorithm, where the unclassified sample set includes a plurality of user financial data.

The classification module 402 includes: an input unit for inputting a plurality of user financial data in the set of unclassified samples of a machine learning model into a classification model, respectively; the tag unit is used for carrying out encoding and decoding operation on each financial data in the plurality of user financial data by the classification model so as to determine a classification tag of the financial data;

wherein the tag unit includes: the characteristic subunit is used for carrying out coding operation on each piece of user financial data in the plurality of pieces of user financial data by the classification model to generate a characteristic code; the reconstruction subunit is used for sequentially inputting the feature codes into the multilayer neural network structure of the classification model for reconstruction processing to generate a low-dimensional feature value; and a comparison subunit, configured to compare the low-dimensional feature value with the step feature value in the classification model to determine a classification label of the user financial data.

The classification module 402 further includes: the data unit is used for extracting part of user financial data from the positive sample set to generate training data; the data unit is also used for extracting part of user financial data from the positive sample set; and screening the multi-dimensional user characteristics of the user financial data to generate training data. And the training unit is used for training a self-coding algorithm model through the training data to generate the classification model. A subset unit for generating the plurality of unclassified subsets from the user financial data having the same label; wherein the classification model is generated by a self-encoding algorithm.

The comparing module 404 is configured to perform similarity comparison between a positive sample set and each of the plurality of unclassified subsets, where the positive sample set includes a plurality of user financial data; the comparing module 404 is further configured to perform similarity comparison between the positive sample set and each of the plurality of unclassified subsets by using a minhash algorithm.

The comparison module 404 includes: the sorting unit is used for determining a plurality of first hash values of the user financial data in the positive sample set according to a preset mode; determining a plurality of second hash values of the user financial data in the unclassified subset according to a preset mode; and a similarity unit for comparing the plurality of first hash values with the plurality of second hash data to perform similarity comparison.

The set module 406 is configured to determine the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result. The aggregation module 406 includes: the numerical value unit is used for generating a similarity numerical value according to the similarity comparison result of the users with preset ranks; a positive sample unit, configured to determine the unclassified subset as a positive sample subset when the similarity value is greater than or equal to a threshold; and a negative sample unit, configured to determine the unclassified subset as a negative sample subset when the similarity value is smaller than a threshold.

The positive sample module 408 is configured to generate positive sample financial data for the machine learning model through the positive sample subset and the at least one positive sample subset after the similarity comparison is finished; and

the negative examples module 410 is configured to generate negative example financial data for the machine learning model from at least one negative example subset.

The unclassified sample processing device 40 of the machine learning model further includes: a model module to train a machine learning model with the positive sample financial data and the negative sample financial data to generate a user breach risk model.

According to the unclassified sample processing device of the machine learning model disclosed by the invention, an unclassified sample set of the machine learning model is divided into a plurality of unclassified subsets through a self-coding algorithm; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result, so that a positive sample in the unclassified sample can be extracted, and the positive sample and the negative sample can be accurately classified, thereby improving the calculation effect and the calculation precision of the machine learning model.

An electronic device 500 according to this embodiment of the disclosure is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 that couples various system components including the memory unit 520 and the processing unit 510, a display unit 540, and the like.

Wherein the storage unit stores program code executable by the processing unit 510 to cause the processing unit 510 to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 510 may perform the steps as shown in fig. 1, 2, 3.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

The memory unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 500' (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 6, the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present disclosure.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: dividing an unclassified sample set of a machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data; respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. An unclassified sample processing method for a machine learning model, comprising the following steps:

dividing an unclassified sample set of a machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data;

respectively performing similarity comparison on a positive sample set and each unclassified subset in the plurality of unclassified subsets, wherein the positive sample set comprises a plurality of user financial data; and

and determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result.

2. The method of claim 1, further comprising:

after the similarity comparison is finished, generating positive sample financial data for the machine learning model through the positive sample subset and at least one positive sample subset; and

negative sample financial data for the machine learning model is generated by at least one negative sample subset.

3. The method of any one of claims 1-2, wherein separating the set of unclassified samples of the machine learning model into a plurality of unclassified subsets by a self-encoding algorithm comprises:

inputting a plurality of user financial data in the set of unclassified samples of a machine learning model into a classification model, respectively;

the classification model performs encoding and decoding operations on each financial data in the plurality of user financial data to determine a classification label of the financial data; and

generating the plurality of unclassified subsets by user financial data having the same label;

wherein the classification model is generated by a self-encoding algorithm.

4. The method of any of claims 1-3, wherein the classification model performing an encoding decoding operation on each of the plurality of user financial data to determine its classification tag comprises:

the classification model carries out coding operation on each piece of user financial data in the plurality of pieces of user financial data to generate a feature code;

sequentially inputting the feature codes into a multilayer neural network structure of the classification model to carry out reconstruction processing to generate a low-dimensional feature value; and

comparing the low-dimensional feature values to the step feature values in the classification model to determine classification labels for the user financial data.

5. The method of any one of claims 1-4, wherein separating the set of unclassified samples of the machine learning model into a plurality of unclassified subsets by a self-encoding algorithm further comprises:

extracting part of user financial data from the positive sample set to generate training data; and

training a self-coding algorithm model with the training data to generate the classification model.

6. The method of any of claims 1-5, wherein generating training data from the extracted partial user financial data from the positive sample set comprises:

extracting part of the user financial data from the positive sample set; and

and screening the multi-dimensional user characteristics of the user financial data to generate training data.

7. The method of any of claims 1-6, wherein comparing the similarity of the positive sample set to each of the plurality of unclassified subsets comprises:

and respectively carrying out similarity comparison on the positive sample set and each unclassified subset in the plurality of unclassified subsets through a minhash algorithm.

8. An unclassified sample processing apparatus of a machine learning model, comprising:

the classification module is used for dividing an unclassified sample set of the machine learning model into a plurality of unclassified subsets through a self-coding algorithm, wherein the unclassified sample set comprises a plurality of user financial data;

a comparison module, configured to perform similarity comparison between a positive sample set and each of the plurality of unclassified subsets, where the positive sample set includes a plurality of user financial data; and

and the set module is used for determining the unclassified subset as a positive sample subset or a negative sample subset according to the similarity comparison result.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.