CN112257818A

CN112257818A - Sample data processing method and device and computer equipment

Info

Publication number: CN112257818A
Application number: CN202011513899.9A
Authority: CN
Inventors: 顾凌云; 谢旻旗; 段湾; 汪仁杰; 张涛; 潘峻
Original assignee: Shanghai IceKredit Inc
Current assignee: Shanghai IceKredit Inc
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-01-22
Anticipated expiration: 2040-12-21
Also published as: CN112257818B

Abstract

The sample data processing method, the sample data processing device and the computer equipment provided by the embodiment of the invention firstly obtain sample data to be processed and determine a plurality of class labels, divide the sample data to be processed into a plurality of sample subsets according to the class labels, secondly sort the plurality of class labels according to the received service requirement information to obtain a class sorting queue, and finally sequentially distribute sample weights for each sample subset under each class label in the class sorting queue. Therefore, when the sample data to be processed relates to the category labels of multiple levels, the method can quickly obtain the sample weight of each sample subset, so that the total weight of each sample subset under the unified level is equal, and meanwhile, the total weight of the samples among different levels of categories can be ensured to be consistent, so that the overall weight balance of the sample data to be processed is more accurate while effective information is kept.

Description

Sample data processing method and device and computer equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a sample data processing method and device and computer equipment.

Background

When a classification model is applied for modeling, problems possibly faced are high in misclassification cost, for example, legal users and illegal users are classified, and the cost of identifying the illegal users as the legal users is far higher than that of identifying the legal users as the illegal users; or the sample height imbalance affects the prediction result, for example, if only 1 user in 10000 user samples is an illegal user and 9999 users are legal users, the accuracy rate of all samples directly predicted as legal users is 99.99%, which obviously has no meaning.

Therefore, for samples with non-uniform categories, a method should be used to balance the weights among the categories, so that when the weights of the categories are generally uniform, the problems of high misclassification cost and reduced model prediction capability caused by the non-uniform category weights can be avoided most effectively. And the weight of a class is the sum of the weights of all the individual samples in that class. Often, the number of modeling samples which can be obtained in an actual wind control scene is not uniform in each category, which may cause the samples not to be unbiased estimation of an actual total sample, and if the ratio of legal users to illegal users in actual business is 10:1, and the ratio of legal users to illegal users in modeled samples is 5:1, the sample distribution cannot represent actual distribution, thereby affecting the prediction effect of the model.

Disclosure of Invention

In order to solve the above problems, the present invention provides a sample data processing method, device and computer equipment.

Based on the first aspect of the embodiments of the present invention, there is provided a sample data processing method applied to a computer device, the method including:

the method comprises the steps of obtaining sample data to be processed, determining a plurality of category labels of the sample data to be processed, and dividing the sample data to be processed into a plurality of sample subsets according to the category labels; wherein each sample subset corresponds to a category label;

receiving service demand information, and sorting the plurality of category labels according to the service demand information to obtain a category sorting queue;

sample weights are assigned to each subset of samples under each category label in the category ordering queue in turn.

Optionally, sequentially assigning a sample weight to each sample subset under each category label in the category sorting queue, including:

determining an overall sample weight according to the total number of the sample subsets;

determining a current sample weight for the subset of samples under each category label based on the overall sample weight and the number of subsets of samples under each category label in the category ordering queue.

Optionally, determining a current sample weight of the subset of samples under each category label comprises:

determining a previous layer sample subset to which all sample subsets under each class label belong and acquiring the sample weight of the previous layer sample subset;

and determining the current sample weight of each sample subset in all the sample subsets under the class label according to the sample weight of the sample subset of the previous layer until the current sample weight of each sample subset under the last class label is determined.

Optionally, sorting the plurality of category labels according to the service requirement information to obtain a category sorting queue, including:

determining a requirement category list corresponding to service requirement information, and constructing a label characteristic list corresponding to the category label, wherein the requirement category list and the label characteristic list respectively comprise a plurality of list elements with different list event weights;

extracting the requirement sample data of any list element of the service requirement information in the requirement category list, and determining the list element with the minimum list event weight in the label characteristic list as a target list element;

mapping the required sample data to the target list element according to the sample data distribution diagram of the sample data to be processed, obtaining required mapping data in the target list element, and generating a correlation coefficient list between the service requirement information and the class label according to the required sample data and the required mapping data;

acquiring data to be associated in the target list element by taking the requirement mapping data as current sample data, matching the data to be associated to the list element where the requirement sample data is located according to a correlation matching path corresponding to the correlation coefficient list, obtaining target associated data corresponding to the data to be associated in the list element where the requirement sample data is located, and determining the target associated data as tag sorting reference data;

acquiring a mapping path track for mapping the demand sample data to the target list element; according to the data transmission defect rate between the target associated data and the mapping attribute data corresponding to the multiple path node units on the mapping path track, sequentially acquiring the sequencing reference results corresponding to the tag sequencing reference data layer by layer in the tag feature list according to the size sequence of the list event weight of the list element, stopping acquiring the sequencing reference result in the next list element until the sequencing confidence coefficient of the list element where the sequencing reference result is located is consistent with the sequencing confidence coefficient of the tag sequencing reference data in the requirement category list, and establishing a sequencing execution path between the tag sequencing reference data and the sequencing reference result acquired last time; and sorting the plurality of category labels based on the sorting execution path to obtain a category sorting queue.

Based on the second aspect of the embodiments of the present invention, there is provided a sample data processing apparatus applied to a computer device, the apparatus including:

the sample dividing module is used for acquiring sample data to be processed, determining a plurality of class labels of the sample data to be processed, and dividing the sample data to be processed into a plurality of sample subsets according to the class labels; wherein each sample subset corresponds to a category label;

the label sorting module is used for receiving the service requirement information and sorting the plurality of category labels according to the service requirement information to obtain a category sorting queue;

and the weight distribution module is used for sequentially distributing sample weight to each sample subset under each class label in the class sorting queue.

Optionally, the weight assignment module is configured to:

Optionally, the tag ordering module is configured to:

According to a third aspect of embodiments of the present invention, there is provided a computer device, comprising a processor and a memory, which are in communication with each other, the processor being configured to retrieve a computer program from the memory and to implement the method of the first aspect by running the computer program.

According to a fourth aspect of embodiments of the present invention, there is provided a computer program stored thereon, which when run implements the method of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a diagram illustrating a conventional weight assignment.

Fig. 2 is a flowchart of a sample data processing method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of existing weight distribution corresponding to a sample data processing method according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

The inventor finds out through research and study that the conventional method for determining the sample weight generally obtains the weight according to the quantity ratio among various samples, or equalizes the total weight of various samples by directly sampling the samples in advance to ensure that the quantity of various samples is equal. The weighting is calculated by using a number ratio mode, that is, when the number of samples in the two classes a and B is Na and Nb respectively, then the ratio of the weights of the respective samples in the classes a and B is Nb: Na, and in practical application, the weight of each sample in the classes a and B can be 1 and or 1. Thus, the total weight of each sample weight and the total weight after summation in the two classes can be always ensured to be equal, namely the class weights are consistent. When the specific weight of each sample is not specified, all samples are by default weighted equally. Therefore, the number of samples corresponding to different classes can be equal by direct sampling, and the weights of the respective sample sums are also necessarily equal.

However, when the samples relate to multiple hierarchical classes, the service requirement generally needs to maintain the weight balance of the samples under the classes, for example, fig. 1 is a sample structure of two-layer classes, where the a class divides the entire data set into two subsets, a1 and a2, the B class under the a1 subset can continue to obtain two subsets, B1 and B2, and the B class division by the a2 all belongs to the B3 class. Assuming the overall sample is loan data, class a may be the good or bad division of the customer, and class B may further be the income level interval of the customer. It is necessary to ensure that the total weights of a1 and a2 are equal and the total weights of B1 and B2 are equal. The weights of the samples in A1 and A2 and the corresponding weights of the samples in B1 and B2 can be obtained according to the data occupation ratio, but the weight relationship between the data sets divided by A and B cannot be coordinated, so that the total weights of the samples in the front and back levels are consistent, namely the total weight of the sample in A1 is equal to the total weight of B1 and B2.

The purpose of directly balancing the total weight of various samples can be achieved by utilizing sampling to enable the number of different types of samples to be equal, but the number of the samples needs to be specified in advance, and on the other hand, the uncertainty of the model effect can be increased by adding or deleting the samples involved in the sampling. Also with multi-level classes, the complexity of terminal class sampling is increased. Or taking fig. 1 as an example, apparently, direct sampling only needs to ensure that the numbers of a1 and a2 are consistent, and the numbers of B1 and B2 are equal, but finally, the numbers of sample sets of B1, B2 and B3 need to be considered comprehensively, and a proper sampling number scheme is provided, so that the information loss caused by adding and deleting samples is minimized. The terminal sample refers to a sample under the category with the smallest hierarchy, and the category B in fig. 1 is the terminal category.

In summary, under the condition of the samples related to the multi-level classes, the sample weight is difficult to maintain consistent weight among the levels by using the proportion, the direct sampling causes information loss due to the addition and deletion of the samples, and the sampling method also has complexity and uncertainty. The invention discloses a method for determining sample weight, which is essentially based on the sample quantity ratio, and is further improved on the basis, so that when a sample relates to a multi-level class, the sample weight of a terminal class can be rapidly obtained, and the weight balance of a sample set of each level class is ensured.

Referring to fig. 2, a flowchart of a sample data processing method, which may be applied to a computer device, is shown, and the method specifically includes the following steps.

Step S21, obtaining sample data to be processed, determining a plurality of class labels of the sample data to be processed, and dividing the sample data to be processed into a plurality of sample subsets according to the class labels.

Wherein each sample subset corresponds to a category label.

Step S22, receiving service requirement information, and sorting the plurality of category labels according to the service requirement information to obtain a category sorting queue.

And step S23, sequentially assigning a sample weight to each sample subset under each class label in the class sorting queue.

It can be understood that based on the contents described in the above steps S21-S23, first, sample data to be processed is obtained and a plurality of class labels are determined, the sample data to be processed is divided into a plurality of sample subsets according to the class labels, then, the plurality of class labels are sorted according to the received service requirement information to obtain a class sorting queue, and finally, a sample weight is sequentially allocated to each sample subset under each class label in the class sorting queue. Therefore, when the sample data to be processed relates to the category labels of multiple levels, the method can quickly obtain the sample weight of each sample subset, so that the total weight of each sample subset under the unified level is equal, and meanwhile, the total weight of the samples among different levels of categories can be ensured to be consistent, so that the overall weight balance of the sample data to be processed is more accurate while effective information is kept.

In an alternative embodiment, the step S23 describes that sample weights are sequentially assigned to each sample subset under each category label in the category sorting queue, and specifically includes the following steps S231 and S232.

Step S231 determines the overall sample weight according to the total number of the sample subsets.

Step S232, determining the current sample weight of the sample subset under each category label based on the overall sample weight and the number of the sample subsets under each category label in the category sorting queue.

Further, the determining of the current sample weight of the sample subset under each class label described in step S232 may specifically include the following contents described in step S2321 and step S2322.

Step S2321, for each class label, determine a previous layer sample subset to which all sample subsets under the class label belong, and obtain a sample weight of the previous layer sample subset.

Step S2322, determining the current sample weight of each sample subset in all the sample subsets under the category label according to the sample weight of the previous layer sample subset until determining the current sample weight of each sample subset under the last category label.

For convenience of description of step S23, the following description is made with reference to fig. 3.

(1) The computer device initializes the weight of the whole sample, and generally initializes the weight of the whole sample to the quantity value W of the total sample in order to prevent the weight of the terminal sample from being too small.

(2) And circularly acquiring the number of the categories under each level by the computer equipment, wherein the sample set weight corresponding to each category under each level is equal to the total weight of the category at the upper level divided by the number of the categories at the level. Taking the example of the multi-level class sample structure of fig. 3 as an example, the number of a classes is 2, and then the total weight of each of the a1 and a2 sample sets is W/2. Similarly, the total weight of the sample set of the B category under the A1 category is (W/2)/2 if the number of the B categories under the A1 category is 2, the number of the B categories under the A2 category is 3, in the example, the next-level categories of A1 and A2 contain B1, but the two B1 categories should be considered as two different categories, the B2 is also considered as two different categories, and the total weight of the three B categories under the A2 category is (W/2)/3. And the total weight of each sample set of the terminal class C is obtained by analogy in turn.

(3) The computer equipment obtains the total weight of each category of the terminal, and the single sample weight under each category of the terminal is the total weight of the category divided by the number of samples of the category.

The equalizing of the total sample weights corresponding to the various layers of categories means that the lumped weights of the samples of the various small categories of the next level divided from the large category sample set of the previous level are equal, and the data lumped weights of the whole samples divided under each layer of categories are not considered to be equal. Still referring to the example of fig. 2, if the total sample weights of all the samples of each class are equal under the division of the whole samples, the total weights of a1 and a2 under the class a are equal and are all W/2, and the collective weights of all the samples of the whole samples corresponding to B1, B2 and B3 under the class B are equal and are all W/3. Then B1 and B2 appear in A1 and A2 class, and the total weights of A1 and A2 samples are equal, then the total weight of B1 sample set under A1, abbreviated as A1B1, should be (W/3)/2) of B1 total weight of 1/2, and similarly, the total weights of A1B2, A2B1 and A2B2 are (W/3)/2, and only the total weight of A2B3 is W/3. At this time, the total sum of the total weights of A1B1 and A1B2 is W/3, and the total sum of the total weights of A2B1, A2B2 and A2B3 is 2W/3, which is inconsistent with the total weight of each sample set of the class A partition. Therefore, the sample balance under each layer category is limited to the balance of the total weights of the sample subsets under the last category division under the layer category, and the example 2 only considers the total weights of B1 and B2 in the A1 subset and only considers the total weights of B1, B2 and B3 in the A2 subset.

In a possible implementation manner, the sorting the plurality of category labels according to the service requirement information described in step S22 to obtain a category sorting queue may be implemented in step S220.

Step S220, respectively determining a requirement category list corresponding to the service requirement information and a tag feature list corresponding to the category tag, determining a sorting execution path according to the requirement category list and the tag feature list, and sorting the category tags through the sorting execution path to obtain a category sorting queue.

For some further embodiments, the content described in step S220 may further include the content described in the following steps S221 to S225.

Step S221, determining a requirement category list corresponding to the service requirement information, and constructing a label characteristic list corresponding to the category label, wherein the requirement category list and the label characteristic list respectively comprise a plurality of list elements with different list event weights.

Step S222, extracting the demand sample data of any list element of the service demand information in the demand category list, and determining the list element with the minimum list event weight in the tag feature list as a target list element.

Step S223, mapping the requirement sample data to the target list element according to the sample data distribution map of the sample data to be processed, obtaining requirement mapping data in the target list element, and generating a correlation coefficient list between the service requirement information and the category label according to the requirement sample data and the requirement mapping data.

Step S224, obtaining data to be associated in the target list element by taking the requirement mapping data as the current sample data, matching the data to be associated to the list element where the requirement sample data is located according to the correlation matching path corresponding to the correlation coefficient list, obtaining the target associated data corresponding to the data to be associated in the list element where the requirement sample data is located, and determining the target associated data as the label sorting reference data.

Step S225, obtaining a mapping path track for mapping the demand sample data to the target list element; according to the data transmission defect rate between the target associated data and the mapping attribute data corresponding to the multiple path node units on the mapping path track, sequentially acquiring the sequencing reference results corresponding to the tag sequencing reference data layer by layer in the tag feature list according to the size sequence of the list event weight of the list element, stopping acquiring the sequencing reference result in the next list element until the sequencing confidence coefficient of the list element where the sequencing reference result is located is consistent with the sequencing confidence coefficient of the tag sequencing reference data in the requirement category list, and establishing a sequencing execution path between the tag sequencing reference data and the sequencing reference result acquired last time; and sorting the plurality of category labels based on the sorting execution path to obtain a category sorting queue.

It can be understood that, based on the above steps S221 to S225, the requirement category list corresponding to the service requirement information and the tag feature list corresponding to the category tag can be respectively obtained, so that mutual mapping processing of the list elements is realized based on the requirement category list and the tag feature list, the correlation coefficient list is further determined, and the ranking reference results in different list elements are further determined, so that consideration of the ranking confidence of the ranking reference results in different list elements can be realized, so as to ensure that the ranking confidence of the finally obtained ranking reference result meets the service requirement, and further ensure that the actual service requirement is fully considered when ranking the category tag.

Based on the same inventive concept, there is provided a sample data processing apparatus applied to a computer device, the apparatus comprising:

Optionally, the weight assignment module is configured to:

Optionally, the tag ordering module is configured to:

On the basis of the above, there is provided a computer device comprising a processor and a memory communicating with each other, the processor being configured to retrieve a computer program from the memory and to implement the above-mentioned method by running the computer program.

On the basis of the above, a computer-readable storage medium is provided, on which a computer program is stored, which computer program realizes the above-described method when executed.

It can be understood that based on the above scheme, the problems that the prediction result is inaccurate, the wrong division cost is high and the like due to unbalanced sample types and uneven sample numbers under different types in actual services can be solved. When the samples relate to the classes of multiple levels, the scheme can rapidly calculate the weight value of a single sample, so that the total weights of various samples at the same level are equal, and the total weights of the samples in different classes of different levels are consistent.

Furthermore, the above scheme does not need to sample, so there is no complicated consideration for sampling, and the calculation process only involves simple cyclic calculation, so the calculation method is simpler and more efficient, and the weight of each sample can be quickly obtained. Since no sampling is required, there is no loss of sample information due to the addition or subtraction of samples, thereby increasing uncertainty. On the contrary, all samples are reserved, and the weight of each layer class is circularly obtained through class sorting from large classes to small classes, so that the weight of each class under the same level is balanced, and the consistency of the total weight among all layers of classes can be ensured. This makes the overall weight equalization more accurate while retaining valid information.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A sample data processing method is applied to a computer device, and comprises the following steps:

2. The method of claim 1, wherein assigning a sample weight to each subset of samples under each class label in the class ordering queue in turn comprises:

3. The method of claim 2, wherein determining a current sample weight for the subset of samples under each class label comprises:

4. The method according to any one of claims 1 to 3, wherein sorting the plurality of category labels according to the service requirement information to obtain a category sorting queue comprises:

5. A sample data processing device applied to a computer device, the device comprising:

6. The apparatus of claim 5, wherein the weight assignment module is configured to:

7. The apparatus of claim 6, wherein the weight assignment module is configured to:

8. The apparatus of any of claims 5-7, wherein the tag ordering module is configured to:

9. A computer device comprising a processor and a memory in communication with each other, the processor being configured to retrieve a computer program from the memory and to implement the method of any one of claims 1-4 by running the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when executed, implements the method of any of claims 1-4.