CN113610168B

CN113610168B - Data processing method, device, equipment and medium

Info

Publication number: CN113610168B
Application number: CN202110918566.2A
Authority: CN
Inventors: 董萍
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2024-05-14
Anticipated expiration: 2041-08-11
Also published as: CN113610168A

Abstract

The embodiment of the application discloses a data processing method, a device, equipment and a medium. Wherein the method comprises the following steps: acquiring an attribute information set of a sample user and characteristic attributes of attribute information in the attribute information set; identifying first attribute information belonging to continuous variables and second attribute information belonging to discrete variables from the attribute information set according to characteristic attributes of the attribute information in the attribute information set; acquiring the statistical characteristics of the first attribute information, and grouping the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables; determining a correlation between the first attribute information and the second attribute information using the plurality of sets of variables and the variables of the second attribute information; and screening attribute information with validity from the first attribute information and the second attribute information based on the correlation determination. By adopting the application, the learning efficiency of the attribute information can be improved, and the resources can be saved.

Description

Data processing method, device, equipment and medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a data processing method, apparatus, device, and medium.

Background

With the development of big data, at present, attribute information (such as age and sex) of users is mainly learned through an identification model so as to select proper staff for enterprises. However, the variety of attribute information is very large, and if all the attribute information is learned, it takes time and wastes resources. For example, in the process of learning attribute information, it is found that there is information obtained by learning some attribute information belonging to continuous variables, which is highly similar to information obtained by learning attribute information of some discrete variables, which corresponds to repeated learning of the same attribute information, resulting in a relatively time-consuming and resource-consuming process.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, equipment and a medium, which can improve the learning efficiency of attribute information and save resources.

In a first aspect, an embodiment of the present application provides a data processing method, including:

Acquiring an attribute information set of a sample user and characteristic attributes of attribute information in the attribute information set; the attribute information set comprises various attribute information of the sample user; identifying first attribute information belonging to continuous variables and second attribute information belonging to discrete variables from the attribute information set according to characteristic attributes of the attribute information in the attribute information set;

Acquiring the statistical characteristics of the first attribute information, and grouping the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables; determining a correlation between the first attribute information and the second attribute information using the plurality of sets of variables and the variables of the second attribute information;

And screening attribute information with validity from the first attribute information and the second attribute information based on the correlation determination.

Optionally, the statistical feature of the first attribute information includes a variable change rate, the obtaining the statistical feature of the first attribute information includes grouping the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables, including:

Acquiring a difference value between every two variables belonging to the first attribute information and the number of variables belonging to the first attribute information;

Determining a variable change rate of the first attribute information based on a difference value between every two variables belonging to the first attribute information and the number of variables belonging to the first attribute information;

If the variable change rate of the first attribute information is larger than a change rate threshold, grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables;

And if the variable change rate of the first attribute information is smaller than or equal to the change rate threshold, grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables.

Optionally, the statistical feature of the first attribute information includes a variable difference, the obtaining the statistical feature of the first attribute information, and performing grouping processing on the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables, including:

Extracting a maximum variable and a minimum variable from variables belonging to the first attribute information;

determining a difference between the maximum variable and the minimum variable as a variable difference;

If the variable difference value is larger than a difference value threshold value, grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables;

and if the variable difference value is smaller than or equal to the difference value threshold value, grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables.

Optionally, the grouping processing is performed on the variables of the first attribute information by using an equidistant grouping policy to obtain multiple groups of variables, including:

sorting the variables of the first attribute information to obtain sorted variables;

grouping the ordered variables according to the difference value between the maximum variable and the minimum variable in the first attribute information to obtain a plurality of groups of variables; the difference value between the maximum variable and the minimum variable in each group of variables in the multiple groups of variables is in a difference value range;

The grouping processing is performed on the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables, including:

grouping the ordered variables according to the variable quantity of the first attribute information to obtain a plurality of groups of variables; the number of variables in each of the plurality of sets of variables is the same.

Optionally, the determining the correlation between the first attribute information and the second attribute information using the multiple sets of variables and the variables of the second attribute information includes:

calculating the total error between the plurality of sets of variables and the variables of the second attribute information, and standard deviation;

A ratio between the overall error and the standard deviation is determined as a correlation between the first attribute information and the second attribute information.

Acquiring grade errors between the multiple groups of variables and the variables of the second attribute information and the number of the variables in the second attribute information; the variable number of the second attribute information is the same as the group number of the plurality of groups of variables;

The level error is subjected to power to obtain a first numerical value, and the difference between the third power of the variable quantity of the second attribute information and the variable quantity of the second attribute information is calculated to obtain a second numerical value;

And determining the correlation between the first attribute information and the second attribute information according to the ratio between the first value and the second value.

Optionally, the selecting attribute information with validity from the first attribute information and the second attribute information based on the correlation determination includes:

If the correlation is smaller than a correlation threshold, the first attribute information and the second attribute information are used as attribute information with validity;

if the correlation is greater than or equal to a correlation threshold, the first attribute information is used as attribute information with validity; or the second attribute information is used as attribute information with validity.

In a second aspect, an embodiment of the present application provides a data processing apparatus, including:

The acquisition module is used for acquiring the attribute information set of the sample user and the characteristic attribute of the attribute information in the attribute information set; the identification module is used for identifying first attribute information belonging to discrete variables and second attribute information belonging to continuous variables from the attribute information set according to the characteristic attributes of the attribute information in the attribute information set;

The grouping module is used for acquiring the statistical characteristics of the first attribute information, and grouping the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables; a determining module, configured to determine a correlation between the first attribute information and the second attribute information using the plurality of sets of variables and the variables of the second attribute information;

And the screening module is used for screening the attribute information with validity from the first attribute information and the second attribute information based on the correlation determination.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor adapted to implement one or more instructions; and

A computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising: the computer storage medium stores one or more instructions adapted to be loaded by a processor and to perform the steps of:

In the application, the electronic equipment can acquire the attribute information set of the sample user and the characteristic attribute of the attribute information in the attribute information set, and identify the first attribute information belonging to the continuous variable and the second attribute information belonging to the discrete variable from the attribute information set according to the characteristic attribute of the attribute information in the attribute information set. And acquiring the statistical characteristics of the first attribute information, and carrying out grouping processing on the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables. Then, the electronic device may determine a correlation between the first attribute information and the second attribute information using the plurality of sets of variables and the variables of the second attribute information; and screening attribute information with validity from the first attribute information and the second attribute information based on the correlation determination. By adopting the correlation to screen the first attribute information and the second attribute, repeated learning of the same variable can be avoided, the learning efficiency of the attribute information can be improved, and resources can be saved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of a data processing method provided by an embodiment of the present application is shown in the present application, and the embodiment of the present application is implemented by an electronic device, which may refer to a server or a terminal device, for example, the electronic device may be a stand-alone server, or a server cluster formed by a plurality of servers, or a cloud computing center, a tablet computer, a notebook computer, a palm computer, a smart sound device, a mobile internet device (MID, mobile INTERNET DEVICE), or the like. As shown in fig. 1, the data processing method includes the following steps S101 to S105.

S101, acquiring an attribute information set of a sample user and characteristic attributes of attribute information in the attribute information set; the attribute information set comprises various attribute information of the sample user; the characteristic attribute of the attribute information in the attribute information set includes at least one of a data type of the attribute information and a variable belonging to the attribute information.

The electronic equipment can download the attribute information set of the sample user from the network or acquire the attribute information set of the sample user from other electronic equipment; the set of attribute information includes a plurality of attribute information of the sample user, for example, the set of attribute information includes one or more of a name, an age, a gender, a place of business, an academic calendar, a income, a work address, and the like of the sample user. Further, the electronic device may acquire a feature attribute of the attribute information in the attribute information set; the characteristic attribute of the attribute information in the attribute information set includes at least one of a data type of the attribute information and a variable belonging to the attribute information. The data type of the attribute information may include a text type and a numeric type, and the variable belonging to the attribute information may refer to data corresponding to the attribute information, for example, the attribute information is age, 39 years old, 40 years old, 20 years old, and the like are variables of the attribute information.

S102, according to the characteristic attribute of the attribute information in the attribute information set, identifying first attribute information belonging to continuous variables and second attribute information belonging to discrete variables from the attribute information set.

The electronic device may identify, from the set of attribute information, first attribute information belonging to the continuous variable and second attribute information belonging to the discrete variable based on characteristic attributes of the attribute information in the set of attribute information. For example, the first attribute information may include age, income, and the like; the second attribute information includes gender, name, place of business, academic, work address, and the like.

Alternatively, since the continuous type attribute information is generally of a numeric type, the discrete type attribute information is generally of a text type; thus, the attribute features of the attribute information include data types, and the attribute types include text types and numeric types; if the attribute information belongs to the numerical value type, determining the attribute information as first attribute information of continuous variables; if the attribute information is of the text type, the attribute information is determined as second attribute information of the discrete variable.

Alternatively, since the continuous attribute information contains a relatively large number of variables, the discrete attribute information contains a relatively small number of variables; such as age 20,30,40, … …; the marital status only comprises three kinds of marriage, married and divorced. Thus, the attribute features of the attribute information include variables; counting the number of variables contained in each attribute information; taking attribute information with the variable quantity larger than a quantity threshold value in the attribute information set as continuous first attribute information; and taking attribute information with the variable quantity smaller than or equal to the quantity threshold value in the attribute information set as discrete second attribute information.

S103, acquiring statistical characteristics of the first attribute information, and carrying out grouping processing on the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables; the statistical characteristic of the first attribute information includes at least one of a variable rate of change and a variable difference.

Because of the lack of a method for calculating the correlation between discrete attribute information and continuous attribute information, it is necessary to perform discretization processing on continuous attribute information. Specifically, the electronic device may obtain a statistical feature of the first attribute information, and perform grouping processing on the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables.

Optionally, the statistical feature of the first attribute information includes a variable change rate, the obtaining the statistical feature of the first attribute information includes grouping the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables, including: acquiring a difference value between every two variables belonging to the first attribute information and the number of variables belonging to the first attribute information; determining a variable change rate of the first attribute information based on a difference value between every two variables belonging to the first attribute information and the number of variables belonging to the first attribute information; if the variable change rate of the first attribute information is larger than a change rate threshold, grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables; and if the variable change rate of the first attribute information is smaller than or equal to the change rate threshold, grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables.

When the statistical feature comprises the variable change rate of the first attribute information, the electronic equipment can acquire the difference value between every two variables in the first attribute information and the variable quantity in the first attribute information; the ratio between the sum of the differences between every two variables and the number of variables is taken as the change rate of the first attribute information. If the variable change rate of the first attribute information is larger than the change rate threshold value, indicating that the distribution of the variables of the first attribute information is more scattered; and grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables. If the variable change rate of the first attribute information is smaller than or equal to the change rate threshold value, indicating that the distribution comparison aggregation of the variables of the first attribute information; and grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables. By grouping the first attribute information according to the variable change rate, the imbalance condition that the variable of some groups is relatively more and the variable of some groups is relatively less can be effectively avoided.

Optionally, the statistical feature of the first attribute information includes a variable difference, the obtaining the statistical feature of the first attribute information, and performing grouping processing on the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables, including: extracting a maximum variable and a minimum variable from variables belonging to the first attribute information; determining a difference between the maximum variable and the minimum variable as a variable difference; if the variable difference value is larger than a difference value threshold value, grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables; and if the variable difference value is smaller than or equal to the difference value threshold value, grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables.

Under the condition that the variable quantity in the first attribute information is certain, the larger the variable difference value is, the more the variable distribution of the first attribute information is scattered, and the smaller the variable difference value is, the more the variable distribution of the first attribute information is concentrated; therefore, the electronic device can perform grouping processing on the variables of the first attribute information according to the variable difference values. Specifically, the electronic device may extract a maximum variable and a minimum variable from the variables belonging to the first attribute information; and determining the difference between the maximum variable and the minimum variable as a variable difference. And if the variable difference value is larger than a difference value threshold value, indicating that the variable distribution of the first attribute information is distributed in a relatively scattered manner, grouping the variables of the first attribute information by adopting an equidistant grouping strategy to obtain a plurality of groups of variables. If the variable difference value is smaller than or equal to the difference threshold value, indicating that the distribution comparison and aggregation of the variables of the first attribute information are carried out; and grouping the variables of the first attribute information by adopting an equal frequency grouping strategy to obtain a plurality of groups of variables. The variable difference value of the first attribute information is used for grouping the variables of the first attribute information, so that the imbalance condition that the number of the variables of some groups is relatively large and the number of the variables of some groups is relatively small can be effectively avoided.

Optionally, the grouping processing is performed on the variables of the first attribute information by using an equidistant grouping policy to obtain multiple groups of variables, including: sorting the variables of the first attribute information to obtain sorted variables; grouping the ordered variables according to the difference value between the maximum variable and the minimum variable in the first attribute information to obtain a plurality of groups of variables; the difference between the largest variable and the smallest variable in each of the plurality of sets of variables is within a range of differences.

The electronic equipment can adopt an equidistant grouping strategy to conduct grouping processing on the variables of the first attribute information, and a plurality of groups of variables are obtained. Specifically, the electronic device may sort the variables of the first attribute information in order from small to large or from large to small, obtain the sorted variables,

Optionally, the grouping processing is performed on the variables of the first attribute information by using an equal frequency grouping policy to obtain multiple groups of variables, including: sorting the variables of the first attribute information to obtain sorted variables; grouping the ordered variables according to the variable quantity of the first attribute information to obtain a plurality of groups of variables; the number of variables in each of the plurality of sets of variables is the same.

For example, the first attribute information includes ages 18,20, 24,25, 29, 90. If an equal frequency grouping strategy is adopted, the first attribute information is ordered to obtain 18,20, 24,25, 29 and 90, the first attribute information is divided into three groups according to the arrangement sequence, namely, two variables of each group are [18,20], [24,25], [29 and 90], and the number of the groups can be determined according to the data volume of the first attribute information. Obtaining a maximum variable and a minimum variable in the first attribute information if an equidistant grouping strategy is adopted, calculating a difference value between the maximum variable and the minimum variable, and if the first attribute information is divided into 3 groups, (90-18)/3=24 as one group, the division logic is [18,42 ], [42,66 ], [66,90], the variables belonging to [18,42) are 18,20, 24,25 and 29; no variable belongs to [42,66) and the variables of [66,90] include 90.

S104, determining the correlation between the first attribute information and the second attribute information by adopting the variable groups and the variable of the second attribute information.

S105, based on the correlation determination, the attribute information with validity is screened out from the first attribute information and the second attribute information.

In step S104 and step S105, the electronic device may determine, using a plurality of sets of variables and variables of the second attribute information, a correlation between the first attribute information and the second attribute information, where the correlation is used to reflect a similarity between the first attribute information and the second attribute information, that is, the higher the correlation, the higher the similarity between the first attribute information and the second attribute information; the lower the correlation, the lower the similarity between the first attribute information and the second attribute information. Since the information provided by the two attribute information with higher correlation is highly similar, that is, the two attribute information with higher correlation is equivalent to the same variable; therefore, if two attribute information having higher correlation are both learned, it is equivalent to repeated learning of the same variable, resulting in waste of resources. Based on the above, the electronic device can screen the attribute information with validity from the first attribute information and the second attribute information based on the correlation determination, and the first attribute information and the second attribute information are screened, so that repeated learning of the same variable can be avoided, the learning efficiency of the attribute information can be improved, and resources can be saved.

Fig. 2 is a flowchart of another data processing method according to an embodiment of the present application, which is executed by an electronic device, and the data processing method includes the following steps S201 to S206.

S201, acquiring an attribute information set of a sample user and characteristic attributes of attribute information in the attribute information set; the attribute information set comprises various attribute information of the sample user; the characteristic attribute of the attribute information in the attribute information set includes at least one of a data type of the attribute information and a variable belonging to the attribute information.

S202, according to the characteristic attribute of the attribute information in the attribute information set, identifying first attribute information belonging to continuous variables and second attribute information belonging to discrete variables from the attribute information set.

S203, acquiring the statistical characteristics of the first attribute information, and carrying out grouping processing on the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables; the statistical characteristic of the first attribute information includes at least one of a variable rate of change and a variable difference.

S204, determining the correlation between the first attribute information and the second attribute information by adopting the variable groups and the variable of the second attribute information. Optionally, the determining the correlation between the first attribute information and the second attribute information using the multiple sets of variables and the variables of the second attribute information includes: calculating the total error between the plurality of sets of variables and the variables of the second attribute information, and standard deviation; a ratio between the overall error and the standard deviation is determined as a correlation between the first attribute information and the second attribute information. For example, the electronic device may calculate the correlation between the first attribute information and the second attribute information using the following formula (1).

In formula (1), cov (X1, X2) represents the total error between the plurality of sets of variables and the variables of the second attribute information, σ _X1σ_X2 represents the standard deviation between the plurality of sets of variables and the variables of the second attribute information. X1 represents the representative number of any one of the plurality of sets of variables, and X2 represents any one of the variables in the second attribute information. The representative number of each set of variables may be an average value, or a randomly chosen value, based on the variables in the set of variables; ρ _X1,X2 represents the correlation between any one set of variables and any one variable in the second attribute information. Thus, further, the correlation between the first attribute information and the second attribute information may be determined from the correlation between any one of the set of variables and any one of the second attribute information.

Optionally, the determining the correlation between the first attribute information and the second attribute information according to the correlation between any one of the set of variables and any one of the second attribute information includes: calculating a correlation sum corresponding to the correlation between any one set of variables and any one variable in the second attribute information, and the variable set number in the plurality of sets of variables; and determining average correlation according to the correlation sum and the variable group number, and taking the average correlation as the correlation between the first attribute information and the second attribute information.

For example, the first attribute information is age, the packet data of age includes (18, 20), (20, 24), (29, 90), and the second attribute information is marital status; a first correlation may be calculated (18, 20) for similarity to the marital status, and a second correlation may be calculated (20, 24) for similarity to the marital status; calculating (29, 90) a similarity to the marital status to obtain a third correlation; calculating a cumulative sum of the first correlation, the second pixel value, and the third correlation; determining an average correlation from the running sum; the average correlation is taken as the correlation between the first attribute information and the second attribute information.

Alternatively, the similarity value between each set of variables and the variables of the second attribute information is typically within a range (i.e., not significantly different); therefore, in order to avoid calculation errors, filtering may be performed on the plurality of similarity values, and specifically, correlation between each set of data and the second attribute information may be calculated, so as to obtain a plurality of similarity values; deleting the similarity value which is larger than the first similarity threshold value and the similarity value which is smaller than the second similarity threshold value in the plurality of similarity values to obtain an effective similarity value; and calculating the similarity value between the first attribute information and the second attribute information according to the effective similarity value. Wherein the first similarity threshold is greater than the second similarity threshold.

Optionally, a correlation between each set of variables and the second attribute information may be calculated, to obtain a plurality of correlations; the minimum correlation or the maximum correlation among the plurality of correlations may be regarded as the correlation of the first attribute information and the second attribute information.

Optionally, the determining the correlation between the first attribute information and the second attribute information using the multiple sets of variables and the variables of the second attribute information includes: acquiring grade errors between the multiple groups of variables and the variables of the second attribute information and the number of the variables in the second attribute information; the number of variables of the second attribute information is the same as the number of groups of the plurality of groups of variables. The level error is subjected to power to obtain a first numerical value, and the difference between the third power of the variable quantity of the second attribute information and the variable quantity of the second attribute information is calculated to obtain a second numerical value; and determining the correlation between the first attribute information and the second attribute information according to the ratio between the first value and the second value. By determining the correlation between the first attribute information and the second attribute information according to the level errors between the various variables and the variables of the second attribute information, the complex calculation process can be simplified, the calculation amount can be reduced, and the calculation resources can be saved.

For example, the electronic device may calculate the correlation between the first attribute information and the second attribute information using the following formula (2).

In the formula (2), ρ represents the correlation between the first attribute information and the second attribute information, n is the number of medium variables of the second attribute information, and d represents the level error between any one of the set of variables and any one of the variables of the second attribute information.

Wherein the level error between any one of the set of variables and any one of the variables of the second attribute information may include: determining a first level of each of the plurality of sets of variables, determining a second level of each of the variables of the second attribute information, and determining a difference between the first level and the second level as a level error between any one of the plurality of sets of variables and any one of the variables of the second attribute information. For example, the levels corresponding to (18, 20), (20, 24), (29, 90) are 1, 2, 3, respectively, and the levels corresponding to the wedding state, the unbroken wedding state, and the divorce state are 4, 5, and 6, respectively, and therefore, the level errors between (18, 20) and the variables of the second attribute information are-3, -4, -5, respectively, and the level errors between (20, 24) and the variables of the second attribute information are-2, -3, -4, respectively. The level errors between (29, 90) and the variables of the second attribute information are-1, -2, -3, respectively.

S205, if the correlation is smaller than a correlation threshold, the first attribute information and the second attribute information are used as attribute information with validity;

s206, if the correlation is greater than or equal to a correlation threshold, the first attribute information is used as attribute information with validity; or the second attribute information is used as attribute information with validity.

In step S205 and step S206, if the correlation is smaller than the correlation threshold, it indicates that the similarity between the first attribute information and the second attribute information is relatively low, that is, the first attribute information and the second attribute information can provide different information. Therefore, the first attribute information and the second attribute information can be regarded as attribute information having validity; the first attribute information and the second attribute information are beneficial to subsequent learning so as to provide more information. And if the correlation is greater than or equal to the correlation threshold, the similarity between the first attribute information and the second attribute information is higher, namely the information provided by the first attribute information and the second attribute information is highly similar. Therefore, the first attribute information can be regarded as attribute information having validity; or the second attribute information is used as attribute information with validity, namely, one of the first attribute information and the second attribute information can be learned later, repeated learning of the same variable can be avoided, the learning efficiency of the attribute information can be improved, and resources can be saved.

Referring to fig. 3, a schematic structural diagram of a data processing apparatus according to an embodiment of the present application is shown in fig. 3, where the data processing apparatus in this embodiment includes: an acquisition module 301, an identification module 302, a grouping module 303, a determination module 304, and a screening module 305.

Optionally, the statistical feature of the first attribute information includes a variable change rate, the grouping module obtains the statistical feature of the first attribute information, performs grouping processing on the variables of the first attribute information according to the statistical feature, and obtains a plurality of groups of variables, including:

Optionally, the statistical feature of the first attribute information includes a variable difference, the grouping module obtains the statistical feature of the first attribute information, performs grouping processing on the variables of the first attribute information according to the statistical feature, and obtains a plurality of groups of variables, including:

Optionally, the grouping module performs grouping processing on the variables of the first attribute information by using an equidistant grouping policy to obtain multiple groups of variables, including:

optionally, the grouping module performs grouping processing on the variables of the first attribute information by adopting an equal-frequency grouping policy to obtain multiple groups of variables, including:

Optionally, the determining module determines the correlation between the first attribute information and the second attribute information using the multiple sets of variables and the variables of the second attribute information, including:

Optionally, the screening module screens attribute information with validity from the first attribute information and the second attribute information based on the correlation determination, including:

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device in the embodiment shown in fig. 4 may include: one or more processors 21; one or more input devices 22, one or more output devices 23, and a memory 24. The processor 21, the input device 22, the output device 23, and the memory 24 are connected via a bus 25.

The Processor 21 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 22 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., the output device 23 may include a display (LCD, etc.), a speaker, etc., and the output device 23 may output the corrected data table.

The memory 24 may include read only memory and random access memory and provides instructions and data to the processor 21. A portion of the memory 24 may also comprise a non-volatile random access memory, the memory 24 being adapted to store a computer program comprising program instructions, the processor 21 being adapted to execute the program instructions stored by the memory 24 for performing a data processing method, i.e. for performing the following operations:

Optionally, the statistical feature of the first attribute information includes a variable change rate, the processor 21 is configured to execute program instructions stored in the memory 24, to perform obtaining the statistical feature of the first attribute information, and perform grouping processing on the variables of the first attribute information according to the statistical feature to obtain multiple groups of variables, where the processing includes:

Optionally, the statistical feature of the first attribute information includes a variable difference, the processor 21 is configured to execute program instructions stored in the memory 24, to perform obtaining the statistical feature of the first attribute information, and perform grouping processing on the variables of the first attribute information according to the statistical feature to obtain multiple groups of variables, where the processing includes:

Optionally, the processor 21 is configured to execute program instructions stored in the memory 24, and configured to perform grouping processing on the variables of the first attribute information using an equidistant grouping policy, to obtain multiple groups of variables, where the processing includes:

Optionally, the processor 21 is configured to execute program instructions stored in the memory 24, for executing determining a correlation between the first attribute information and the second attribute information using the plurality of sets of variables and the variables of the second attribute information, including:

Optionally, the processor 21 is configured to execute program instructions stored in the memory 24, and configured to perform screening of attribute information having validity from the first attribute information and the second attribute information based on the correlation determination, including:

Embodiments of the present application also provide a computer readable storage medium storing a computer program, where the computer program includes program instructions that when executed by a processor implement a data processing generation method as shown in the embodiments of fig. 1 and 2.

The computer readable storage medium may be an internal storage unit of the electronic device according to any of the foregoing embodiments, for example, a hard disk or a memory of the control device. The computer-readable storage medium may also be an external storage device of the control device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASHCARD), or the like, which are provided on the control device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the control device. The computer-readable storage medium is used to store the computer program and other programs and data required by the control device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

As an example, the computer-readable storage medium described above may be deployed to be executed on one computer device or on multiple computer devices that are deployed at one site or on multiple computer devices that are distributed across multiple sites and interconnected by a communication network, where the multiple computer devices that are distributed across multiple sites and interconnected by a communication network may constitute a blockchain network.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the control apparatus and unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated here.

In several embodiments provided by the present application, it should be understood that the disclosed control apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are illustrative, and for example, the division of the units may be a logic function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of data processing, comprising:

Acquiring an attribute information set of a sample user and characteristic attributes of attribute information in the attribute information set; the attribute information set comprises one or more of the name, age, sex, native, academic, income and work address of the sample user; the characteristic attribute of the attribute information in the attribute information set comprises at least one of the data type of the attribute information and the variable belonging to the attribute information;

Identifying first attribute information belonging to continuous variables and second attribute information belonging to discrete variables from the attribute information set according to characteristic attributes of the attribute information in the attribute information set;

Acquiring the statistical characteristics of the first attribute information, and grouping the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables;

Determining a first level of each of the plurality of sets of variables, determining a second level of each variable of the second attribute information;

Acquiring level errors between the multiple groups of variables and the variables of the second attribute information according to the difference between the first level and the second level, and acquiring the number of the variables in the second attribute information; the variable number of the second attribute information is the same as the group number of the plurality of groups of variables;

determining a correlation between the first attribute information and the second attribute information according to a ratio between the first value and the second value;

2. The method of claim 1, wherein the statistical feature of the first attribute information includes a variable change rate, the obtaining the statistical feature of the first attribute information, and the grouping the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables includes:

3. The method of claim 1, wherein the statistical feature of the first attribute information includes a variable difference, the obtaining the statistical feature of the first attribute information, and grouping the variables of the first attribute information according to the statistical feature to obtain a plurality of groups of variables, includes:

4. A method according to claim 2 or 3, wherein grouping the variables of the first attribute information using an equidistant grouping strategy to obtain a plurality of groups of variables comprises:

5. The method according to claim 1, wherein the method further comprises:

and if the correlation is smaller than a correlation threshold, taking the first attribute information and the second attribute information as attribute information with validity.

6. A data processing apparatus, comprising:

The acquisition module is used for acquiring the attribute information set of the sample user and the characteristic attribute of the attribute information in the attribute information set; the attribute information set comprises one or more of the name, age, sex, native, academic, income and work address of the sample user; the characteristic attribute of the attribute information in the attribute information set comprises at least one of the data type of the attribute information and the variable belonging to the attribute information;

The identification module is used for identifying first attribute information belonging to discrete variables and second attribute information belonging to continuous variables from the attribute information set according to the characteristic attributes of the attribute information in the attribute information set;

The grouping module is used for acquiring the statistical characteristics of the first attribute information, and grouping the variables of the first attribute information according to the statistical characteristics to obtain a plurality of groups of variables;

A determining module, configured to determine a first level of each of the plurality of sets of variables, and determine a second level of each variable of the second attribute information; acquiring level errors between the multiple groups of variables and the variables of the second attribute information according to the difference between the first level and the second level, and acquiring the number of the variables in the second attribute information; the variable number of the second attribute information is the same as the group number of the plurality of groups of variables; the level error is subjected to power to obtain a first numerical value, and the difference between the third power of the variable quantity of the second attribute information and the variable quantity of the second attribute information is calculated to obtain a second numerical value; determining a correlation between the first attribute information and the second attribute information according to a ratio between the first value and the second value;

The screening module is used for taking the first attribute information as effective attribute information if the correlation is greater than or equal to a correlation threshold value; or the second attribute information is used as attribute information with validity.

7. An electronic device, comprising:

A processor adapted to implement one or more instructions; and

A computer readable storage medium storing one or more instructions adapted to be loaded by the processor and to perform the data processing method of any of claims 1-5.

8. A computer readable storage medium storing one or more instructions adapted to be loaded by a processor and to perform a data processing method according to any one of claims 1-5.