CN106919957B

CN106919957B - Method and device for processing data

Info

Publication number: CN106919957B
Application number: CN201710142483.2A
Authority: CN
Inventors: 徐骄
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-03-10
Filing date: 2017-03-10
Publication date: 2020-03-10
Anticipated expiration: 2037-03-10
Also published as: CN106919957A

Abstract

The embodiment of the invention discloses a method and a device for processing data. The method comprises the following steps: acquiring a data sample and attribute information of each attribute of the data sample; performing clustering calculation on the data samples according to the attribute information to determine filling values corresponding to various data samples, and updating the corresponding data samples according to the filling values; if the updated data sample meets the end condition, ending the operation; otherwise, clustering calculation is carried out on the updated data sample again until the updated data sample meets the end condition. By adopting the technical scheme, the filling values and the data samples containing the missing values corresponding to the filling values are determined through clustering calculation, so that the correctness of the filling values and the validity of data information can be improved, the processing speed of the missing values is improved, the time for processing the missing values is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Description

Method and device for processing data

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for processing data.

Background

In recent years, with the development of information processing technology, large data is increasingly applied to various fields such as navigation systems and city planning.

The current big data architecture generally performs data processing by taking data flow as guidance, that is, firstly, data is acquired from a data source and stored, then the data is preprocessed, and then data modeling, data analysis and data mining are performed according to the preprocessed data, and finally data change is realized. Therefore, data preprocessing is the basis of the whole data processing process in a big data structure, the quality and the precision of the data preprocessing can directly influence the index definition of data dimension modeling, the selection of a data mining algorithm or the accuracy measurement of data in the subsequent links, and the data preprocessing is one of the important links of the data processing process.

In the prior art, when processing data, the missing value in the data is generally processed by methods such as manual filling, deleting recorded data containing the missing value (i.e., an erasure method), filling with special characters (e.g., NULL), or filling the missing value using a statistical mean or a mode. However, when the data volume is large or reaches a certain level, the manual filling needs to consume much time and energy, and the requirements of real-time and rapid transmission and processing of the data stream cannot be met; deleting recorded data containing missing values, using uniform special characters or using statistical mean or mode to fill missing values has no pertinence, which can cause the data accuracy and effectiveness to be reduced, thus the prior art can not meet the requirements of high efficiency and high precision of missing value processing at the same time.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for processing data, so as to solve the technical problem that a data processing method in the prior art cannot simultaneously meet requirements of high efficiency and high precision of missing value processing.

In a first aspect, an embodiment of the present invention provides a method for processing data, including:

acquiring a data sample and attribute information of each attribute of the data sample, wherein the data sample comprises a data sample containing a missing value and a data sample not containing the missing value;

performing clustering calculation on the data samples according to the attribute information to determine filling values corresponding to various data samples, and updating the corresponding data samples according to the filling values;

if the updated data sample meets the end condition, ending the operation; otherwise, performing clustering calculation again on the updated data sample until the updated data sample meets an end condition, wherein the end condition comprises: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

In a second aspect, an embodiment of the present invention further provides an apparatus for processing data, including:

the system comprises a sample information acquisition module, a data analysis module and a data analysis module, wherein the sample information acquisition module is used for acquiring a data sample and attribute information of each attribute of the data sample, and the data sample comprises a data sample containing a missing value and a data sample not containing the missing value;

the filling value determining module is used for performing clustering calculation on the data samples according to the attribute information to determine filling values corresponding to various data samples and updating the corresponding data samples according to the filling values;

the cyclic calling module is used for ending the operation if the updated data sample meets the ending condition; otherwise, performing clustering calculation again on the updated data sample until the updated data sample meets an end condition, wherein the end condition comprises: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

According to the technical scheme for processing data, the data samples and the attribute information of the data samples are obtained, the data samples are subjected to clustering calculation according to the attribute information of the data samples to determine the filling values corresponding to the data samples, the corresponding data samples are updated according to the determined filling values, whether the updated data samples meet the end conditions or not is judged, and if yes, the operation is ended; and if not, clustering calculation is carried out on the updated data sample again until the updated data sample meets the end condition. By adopting the technical scheme, the filling values and the data samples containing the missing values corresponding to the filling values are determined through clustering calculation, so that the correctness of the filling values and the validity of data information can be improved, the processing speed of the missing values is improved, the time for processing the missing values is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

fig. 1 is a schematic flowchart illustrating a method for processing data according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for processing data according to a second embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for processing data according to a third embodiment of the present invention;

fig. 4 is a block diagram of an apparatus for processing data according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings.

Example one

The embodiment of the invention provides a method for processing data. The method may be performed by an apparatus for processing data, wherein the apparatus may be implemented by hardware and/or software, and may generally be integrated in a data processing platform. Fig. 1 is a schematic flowchart of a method for processing data according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s101, obtaining a data sample and attribute information of each attribute of the data sample, wherein the data sample comprises a data sample containing a missing value and a data sample not containing the missing value.

In this embodiment, the data sample may be an entity type data sample, and the data sample includes a first data sample and a second data sample, where the first data sample is a data sample containing a missing value, and the second data sample is a data sample containing no missing value. The attribute information of each attribute of the data sample may be information such as an attribute value of each attribute of the data sample, an attribute value type of each attribute, and/or a contribution degree of each attribute. The specific definitions of the data samples containing the missing values and the data samples not containing the missing values may be flexibly set according to different processing manners, for example, the data samples lacking any one or more attribute values may be defined as the data samples containing the missing values, and correspondingly, the data samples not containing the missing values may be defined as the data samples not lacking all attribute values; when a missing value included in an attribute of a data sample is processed, only the data sample in which the attribute value of the attribute is missing may be defined as the data sample including the missing value, and correspondingly, the data sample not including the missing value may be defined as the data sample in which the attribute value of the attribute is not missing or the data sample in which all the attribute values are not missing. In view of simplicity of calculation, it is preferable that, when processing a missing value included in an attribute of a data sample, the data sample in which the attribute value of the attribute is missing be defined as a data sample including the missing value, and the data sample in which all the attribute values are not missing be defined as a data sample not including the missing value.

In specific application, the data sample and the attribute information of the data sample can be pre-stored in a database corresponding to a data processing platform, and when the data sample and the attribute information of the data sample are obtained, the data sample is directly called from the storage position of the data sample, and the attribute information of the data sample is obtained from the storage position of the attribute information of the data sample; the data samples and attribute information thereof sent by other platforms or databases may also be obtained from the data transmission interface in real time, and missing values in the data samples are processed, which is not limited herein.

S102, performing clustering calculation on the data samples according to the attribute information to determine filling values corresponding to various data samples, and updating the corresponding data samples according to the filling values.

In the present embodiment, when processing missing values included in a data sample, the processing may be performed in the horizontal or vertical order, that is, in units of data samples, or in units of attributes. In view of convenience in processing data, optionally, the missing values in the data sample may be processed in attribute units, and the missing values in different attributes may be processed simultaneously or sequentially, that is, a processing order of the missing values in each attribute may be determined first, and then the missing values in each attribute may be processed sequentially according to the processing order; the missing values in the attributes may be processed simultaneously or in a random order.

For example, when a missing value in a certain attribute of a data sample is processed, clustering calculation may be performed on the data sample according to a set clustering algorithm, then filling values corresponding to various types of data samples after clustering calculation are determined, and the determined filling values are filled into data samples to be filled of corresponding types of data samples to achieve updating of the data samples to be filled. The filling value corresponding to a certain type of data sample may be an attribute value of an attribute corresponding to a missing value corresponding to a clustering center of the type of clustering sample, an attribute value with the largest occurrence frequency of the attribute corresponding to the missing value in the type of data sample, or an attribute value of an attribute corresponding to a missing value of a data sample not containing the missing value and having the highest similarity with the attribute value of the attribute corresponding to a non-missing value of the data sample containing the missing value. The clustering algorithm used when clustering the data samples can be flexibly set according to the needs, for example, the clustering algorithm such as K-Means, K-Medoids, Clara or Clarans can be adopted to cluster the data samples. The data samples to be filled in a certain class are data samples containing missing values, and in application, all the data samples containing the missing values in a certain class can be marked as the data samples to be filled in the class; only the data samples in which the part of the class that meets the preset condition contains the missing value may also be marked as data samples to be padded, which is not limited herein. In consideration of the accuracy of the data sample after the data sample containing the missing value is filled, preferably, the data sample meeting the preset condition and containing the missing value is marked as the data sample to be filled, and only the data sample to be filled in each class of data sample is filled in each clustering calculation, for example, the data sample containing the missing value, which is less than a set distance threshold from the clustering center of the class, more than a set similarity threshold from the clustering center, and has a distance sequence from the clustering center within a first set proportion coefficient of the distance sequence of all the data samples in the class and/or has a similarity sequence from the clustering center within a second set proportion coefficient of the similarity sequence of all the data samples in the class, may be marked as the data sample to be filled.

S103, if the updated data sample meets the end condition, ending the operation; otherwise, performing clustering calculation again on the updated data sample until the updated data sample meets an end condition, wherein the end condition comprises: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

Illustratively, when the ending condition is that the updated cluster center of the data sample is the same as the cluster center calculated by clustering last time, the cluster center calculated this time can be recorded and the cluster center calculated by clustering last time can be obtained after the cluster center calculated this time is determined each time, then whether the two cluster centers are the same or not is compared, and if the two cluster centers are the same, the operation is ended; if not, the cluster center of the current calculation is adopted to continue the cluster calculation. Here, it should be noted that the cluster centers of the two clustering calculations may be determined to be the same only when the cluster centers are completely the same, or may be determined to be the same when the similarity of the cluster centers is higher than a preset similarity threshold, for example, when the similarity of the cluster centers of the two clustering calculations reaches or is higher than 99%.

When the ending condition is that the clustering calculation times reach the preset calculation times, the clustering calculation times can be counted through software or hardware with a counting function such as a counter, the counting is increased by 1 every time the clustering calculation is carried out, and the operation is ended when the counting reaches the preset calculation times. Here, the preset number of times of clustering may be flexibly set as needed, for example, the preset number of times of clustering may be set to a value of 10000 times or 20000 times.

When the ending condition is that the updated data does not include the missing value and the proportion of the data samples reaches the set threshold, the missing value information included in the data samples filled with the missing value can be calculated after each clustering calculation is ended, and if the proportion of the data samples not including the missing value reaches the set threshold, the operation is ended. The set threshold may be flexibly set according to different data samples, for example, if the requirement on the integrity of the data sample or an attribute value of an attribute of the data sample is high in specific application, the set threshold may be set to 100%; if the requirement for the integrity of the data sample is not high for a specific application, the set threshold may be set to 99%, 95%, or other proportional value.

Optionally, the ending condition may be set to be that the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, so as to improve the practicability and integrity of the updated data sample.

The method for processing data, provided by the embodiment of the invention, includes the steps of obtaining data samples and attribute information of the data samples, performing clustering calculation on the data samples according to the attribute information of the data samples to determine filling values corresponding to the data samples, updating the corresponding data samples according to the determined filling values, judging whether the updated data samples meet end conditions, and if so, ending the operation; and if not, clustering calculation is carried out on the updated data sample again until the updated data sample meets the end condition. By adopting the technical scheme, the filling values and the data samples containing the missing values corresponding to the filling values are determined through clustering calculation, so that the correctness of the filling values and the validity of data information can be improved, the processing speed of the missing values is improved, the time for processing the missing values is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Example two

Fig. 2 is a flowchart illustrating a method for processing data according to a second embodiment of the present invention. In this embodiment, optimizing on the basis of the foregoing embodiment, further, the performing cluster calculation on the data samples according to the attribute information to determine filling values corresponding to various types of data samples, and updating the corresponding data samples according to the filling values includes: determining the current clustering center according to a set rule; determining distance information between each data sample and each current clustering center according to the attribute information; classifying each data sample according to the distance information; determining a filling value according to an attribute value of an attribute corresponding to each missing value of the data samples in a target class, and updating the data samples which meet preset conditions and contain the missing values in the target class based on the filling value, wherein the target class is the class with a non-missing rate larger than a current non-missing rate threshold, and the non-missing rate is the proportion of the data samples which do not contain the missing values in the data samples.

Correspondingly, as shown in fig. 2, the method for processing data provided by this embodiment includes:

s201, acquiring a data sample and attribute information of each attribute of the data sample, wherein the data sample comprises a data sample containing a missing value and a data sample not containing the missing value.

S202, determining the current clustering center according to a set rule.

In this embodiment, the setting rule for determining the current clustering center may be flexibly set according to needs, for example, a preset number of data samples not containing the missing value may be randomly selected from the data samples as the current clustering center, or the data samples not containing the missing value may be randomly divided into a preset number of classes containing the same number or different numbers of data samples, and an average value of each class is taken as the current clustering center, or several data sample sets composed of the same number of data samples not containing the missing value are randomly generated, and an average value of each data sample set is taken as the current clustering center, and so on.

And S203, determining distance information between each data sample and each current clustering center according to the attribute information.

In this embodiment, the distance information between each data sample and each current clustering center may be determined according to the euclidean distance formula and/or the similarity between each data sample attribute information and each current clustering center, which is not limited herein. In specific application, the distances between the data samples and the current clustering centers can be simultaneously calculated according to a parallel sequence; the distance between each data sample and each current clustering center can also be calculated by taking the data sample or the clustering center as a unit, for example, the distance information between one data sample and each current clustering center can be firstly calculated according to a random or set sequence, then the distance information between the other data sample and each current clustering center can be calculated, and the like until the distance information between each data sample and each current clustering center is calculated, or the distance information between one current clustering center and each data sample can be firstly calculated according to a random or set sequence, then the distance information between the other current clustering center and each data sample bracket can be calculated, and the like until the distance information between each current clustering center and each data sample is calculated.

In this embodiment, the distance information between each data sample and each current clustering center may be determined according to the attribute information of all attributes of each data sample, and the distance information between each data sample and each current clustering center may also be determined according to the attribute information of the attributes related to each data sample, which is not limited here. In consideration of simplicity and practicability of calculation, optionally, the distance information between each data sample and each current clustering center may be determined according to the relevant attribute information of each data sample. At this time, preferably, the determining, according to the attribute information, distance information between each data sample and each current clustering center is specifically: determining distance information between the data sample and each current clustering center according to the attribute value of each relevant attribute of the data sample; or determining distance information between the data sample and each front clustering center according to the attribute value of each relevant attribute of the data sample and the contribution degree of each relevant attribute; and the related attribute is the related attribute of the attribute corresponding to the missing value. The correlation attribute of the attribute corresponding to a certain missing value (the contribution degree of a certain correlation attribute) may be flexibly set by a developer or an operator as needed, or may be determined according to the association degree information between each correlation attribute of the data sample and the attribute corresponding to the missing value (the association degree information between the correlation attribute and the attribute corresponding to the missing value), and the association degree information may be obtained by counting the probability that the attribute values of other attributes (correlation attributes) change when the attribute value of the attribute corresponding to the missing value changes.

Illustratively, the attribute value of each related attribute according to the data sampleWhen determining distance information between a certain data sample and a certain current cluster center, if the attribute values of the relevant attributes of the data sample are discrete numerical values, the ratio of the number of the relevant attributes of the data sample, which are different from the attribute values of the current cluster center, to the total number of the relevant attributes may be used as the distance information between the data sample and the current cluster center, for example, if the missing values have 10 relevant attributes in total, and the attribute values of 8 relevant attributes in the data sample are the same as the current cluster center, the distance M between the data sample and the current cluster center is (10-8)/10 is 0.2; if the attribute values of the relevant attributes of the data sample are all continuous numerical values, the data sample can be processed by an Euclidean distance formula

Or

Calculating the distance information between the data sample and the current cluster center, wherein A1, B1, C1 and D1 are the attribute values of the data sample, A2, B2, C2 and D2 are the attribute values of the current cluster center, for example, if the missing value corresponds to an attribute having two related attributes of an age attribute and a payroll attribute, the age value of the data sample is 28, the payroll value is 4500, the age value of the current cluster center is 29, and the payroll value is 4450, the distance between the data sample and the current cluster center is calculated

When determining distance information between a data sample and a current cluster center according to the attribute values of the relevant attributes of the data sample and the contribution degrees of the relevant attributes, if the attribute values of the relevant attributes of the data sample are all discrete numerical values, the ratio of the sum of the contribution degrees of the relevant attributes, of which the attribute values are different from that of the current cluster center, to the sum of the contribution degrees of all the relevant attributes may be used as the distance information between the data sample and the current cluster center, for example, if the attribute corresponding to the missing value has 4 relevant attributes A, B, C and D (the contribution degrees of the relevant attributes) in total (for example, if the attribute corresponding to the missing value has 4 relevant attributes A, B, C and D in totalDegrees are 0.8, 0.7, 0.9, and 0.85), the attribute values of the related attributes a and B in the data sample are the same as the current cluster center, and the distance M between the data sample and the current cluster center is (0.9+0.85)/(0.8+0.7+0.9+0.85) is 0.54; if the attribute values of the relevant attributes of the data sample are all continuous numerical values, the data sample can be processed by an Euclidean distance formula

Or

Calculating distance information between a data sample and the current clustering center, wherein A1, B1, C1 and D1 are attribute values of the data sample, A2, B2, C2 and D2 are attribute values of the current clustering center, a, B, C and D are contribution degrees of related attributes A1, B1, C1 and D1, for example, if a missing value corresponds to an attribute having two related attributes of an age attribute and a payroll attribute, the age value of the data sample is 28, the payroll value is 4500, the age value of the current clustering center is 29, the payroll value is 4450, and the contribution degrees of the age attribute and the payroll attribute are 0.95 and 0.9 respectively, the distance between the data sample and the current clustering center is 4500.95

Here, it should be noted that, if the correlation attribute of the attribute corresponding to the missing value includes both a discrete value and a continuous value, the continuous value in the correlation attribute may be discretized first, and then distance information between each data sample and each current cluster center is calculated.

And S204, classifying the data samples according to the distance information.

For example, when a certain data sample is classified, the class to which the cluster center closest to the data sample belongs may be selected as the class of the data sample. If two or more cluster centers closest to the data sample exist, the data sample may not be classified in the current cluster calculation, or a class to which the cluster center closest to the data sample belongs may be randomly selected as the class to which the data sample belongs, which is not limited herein.

S205, determining a filling value according to attribute values of attributes corresponding to missing values of all data samples in a target class, and updating the data samples which meet preset conditions and contain the missing values in the target class based on the filling value, wherein the target class is the class with a non-missing rate larger than a current non-missing rate threshold, and the non-missing rate is the proportion of the data samples which do not contain the missing values in the data samples.

In this embodiment, the non-missing rate threshold and the preset condition may be flexibly set as required, for example, the non-missing rate threshold may be set to a proportional value of 0.5 or 0.6, the preset condition may be set such that the distance from the cluster center of the class is smaller than the set distance threshold, the similarity to the cluster center is greater than the set similarity threshold, the distance from the cluster center is sorted within a first set proportional coefficient of the distance sorting of all data samples of the class, and/or the similarity to the cluster center is sorted within a certain proportional coefficient of the similarity sorting of all data samples of the class. It should be noted that the non-missing rate thresholds corresponding to different secondary clustering calculations may be the same value or different values, and are not limited herein. For example, assuming that there are 1000 data samples in a certain class, where the number of data samples missing from the attribute value of the attribute corresponding to the missing value is 200, the non-missing rate of the data samples in the class is: (1000-200)/1000-0.8.

S206, if the updated data sample meets the end condition, ending the operation; otherwise, returning to step S202 until the updated data sample meets an end condition, where the end condition includes: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

The method for processing data provided by the second embodiment of the invention determines the current clustering center according to the set rule, determines the distance information between the data sample and each current clustering center according to the attribute information of the data sample, classifies each data sample according to the obtained distance information, determines a filling value according to the attribute value of the attribute corresponding to the missing value of each data sample in the target class with the non-missing rate greater than the current non-missing rate threshold value, and updates the data sample containing the missing value, which meets the preset condition, in the target class based on the filling value. By adopting the technical scheme, each clustering calculation only updates the data samples which are not in the class with the high deletion rate and partially contain the deletion values, so that the accuracy of the filled filling values can be improved, the accuracy and the effectiveness of the filled data samples are improved, the processing speed of the deletion values is improved, the time for processing the deletion values is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

EXAMPLE III

Fig. 3 is a flowchart illustrating a method for processing data according to a third embodiment of the present invention. The present embodiment performs optimization based on the foregoing embodiment, and further, the determining the current clustering center of clustering calculation according to the set rule includes: judging whether the data sample belongs to the class or not; if the current clustering center does not exist, acquiring at least two data samples which do not contain the missing value from the data samples which do not contain the missing value as the current clustering center; and if so, calculating the centroid point of each class as the current clustering center according to the attribute information of each attribute of the data samples contained in each class.

Further, the determining a filling value according to the attribute value of the attribute corresponding to the missing value of each data sample in the target class, and updating the filling of the missing value in the data sample containing the missing value, which meets the preset condition, in the target class based on the filling value includes: determining a target class capable of being filled with the missing value according to the non-missing rate of the attribute corresponding to the missing value in each class; determining data samples to be filled based on distance information between data samples in the target class each containing a missing value and a current cluster center of the target class; and determining a filling value according to the attribute value of the attribute corresponding to the missing value of the data sample not containing the missing value in the target class, and updating the data sample to be filled according to the filling value.

Further, after determining a filling value according to the attribute value of the attribute corresponding to the missing value of each data sample in the target class and updating the filling of the missing value in the data sample containing the missing value, which meets the preset condition, in the target class based on the filling value, the method further includes: updating the current non-miss rate threshold based on a set update rule.

Correspondingly, as shown in fig. 3, the method for processing data provided by this embodiment includes:

s301, acquiring data samples and attribute information of each attribute of the data samples, wherein the data samples comprise data samples containing missing values and data samples not containing the missing values.

S302, judging whether the data sample belongs to the class, if not, executing a step S303; if yes, go to step S304.

In this embodiment, the method for determining whether the data sample has the belonged class may be flexibly selected according to needs, for example, information of the class to which a certain data sample belongs may be marked in the sample information of the data sample when performing cluster calculation each time, and when determining whether the data sample has the belonged class, sample information of one or more data samples may be arbitrarily obtained, if the sample information of the one or more data samples has the information mark of the class to which the data sample belongs, it is determined that the data sample has the belonged class, otherwise, it is determined that the data sample does not have the belonged class; or storing the clustering center information of the data sample during each clustering calculation, and determining whether the data sample belongs to the class by determining whether the clustering center information corresponding to the data sample is stored in the data processing platform, wherein if the clustering center information corresponding to the data sample is stored in the data processing platform, it is determined that the data sample belongs to the class, and otherwise, it is determined that the data sample does not belong to the class.

S303, acquiring at least two data samples which do not contain the missing values from the data samples which do not contain the missing values as a current clustering center, and executing the step S305.

In this embodiment, the number of the obtained data samples that do not include the missing value may be flexibly set as needed, for example, 3 or 5 data samples that do not include the missing value may be obtained as the current clustering center. In consideration of simplicity and effectiveness of clustering calculation, the number of acquired data samples not containing missing values may be optionally set to 3 to 5. When the data samples are obtained, the data samples not containing the missing values can be obtained randomly or according to a preset rule to serve as the current clustering center, for example, if the attribute values of the attributes corresponding to the missing values are discrete numerical values, the number of the obtained data samples not containing the missing values can be determined according to the number of the attribute values of the attributes corresponding to the missing values; if the attribute value of the attribute corresponding to the missing value is a continuity value, 3-5 data samples not containing the missing value can be randomly acquired as the current clustering center.

S304, calculating the centroid point of each type as the current clustering center according to the attribute information of each attribute of the data sample contained in each type, and executing the step S305.

For example, the mean value of the attribute values of the attributes of the data samples in each class may be used as the current clustering center of the data samples. Correspondingly, when calculating the average value of a certain attribute of a data sample in a certain class, if the attribute value of the attribute is a continuous numerical value, the average value of the attribute values of all the data samples in the class or the average value of the attribute values of the data samples in which the attribute value of the attribute is not missing in the class data samples can be directly calculated as the average value of the attribute of the class data samples; if the attribute value of the attribute is a discrete value, the attribute value of the attribute with the largest occurrence number in the class of data sample may be taken as the average value of the attribute of the class of data sample, and if the attribute value of the attribute with the largest occurrence number is multiple, one of the multiple attribute values with the largest occurrence number may be randomly acquired as the average value of the attribute of the class of data sample.

S305, determining distance information between each data sample and each current clustering center according to the attribute information.

And S306, classifying the data samples according to the distance information.

And S307, determining a target class capable of being filled with the missing value according to the non-missing rate of the attribute corresponding to the missing value in each class.

For example, a non-missing rate threshold for determining whether a certain class is a target class may be obtained first, then a non-missing rate of attributes corresponding to missing values in each class is calculated, and a class greater than or equal to the non-missing rate threshold is determined as a target class. The non-missing rate threshold may be flexibly set as needed, for example, assuming that the non-missing rate threshold is 0.5, at this time, the class with the non-missing rate greater than or equal to 0.5 may be determined as the target class.

S308, determining data samples to be filled based on the distance information between the data samples containing the missing values in the target class and the current clustering center of the target class.

For example, a scale factor for determining the samples to be filled may be preset, and the data samples containing the missing values and having a distance from the cluster center ordered within the scale factor are determined as the data samples to be filled. The scaling factor may be a scaling factor for only the data samples containing the missing value, or may be a scaling factor for all the data samples in the class. For example, assuming that the scaling factor is 10%, data samples containing missing values that are distant from the first 10% of the distance ranking of the data samples containing missing values may be determined as data samples to be padded; data samples containing missing values that are the top 10% of the distance ordering of all data samples of the class may also be determined as data samples to be filled.

S309, determining a filling value according to the attribute value of the attribute corresponding to the missing value of the data sample not containing the missing value in the target class, and updating the data sample to be filled according to the filling value.

For example, if the attribute value of the attribute corresponding to the missing value is a continuous numerical value, the average value of the attribute values of the attribute corresponding to the missing value of the data sample not containing the missing value in the target class may be taken as the filling value of the target class; if the attribute values of the attributes corresponding to the missing values are discrete numerical values, the attribute values of the attributes corresponding to the missing values with the highest occurrence frequency in the data samples which do not contain the missing values in the target class can be determined as filling values, and if the attribute values of the attributes corresponding to the missing values with the highest occurrence frequency in a certain target class are multiple, the missing values in the target class can be not processed by default in the cluster calculation.

And S310, updating the current non-missing rate threshold value based on the set updating rule.

In this embodiment, the set update rule may be flexibly set as needed, for example, different non-missing rate thresholds may be set for each clustering calculation in advance, a non-missing rate threshold corresponding to the clustering calculation is called as a current non-missing rate threshold during each clustering calculation, and the non-missing rate thresholds of each clustering calculation are independent from each other at this time; for example, the current non-missing rate threshold calculated this time may be set to 90% of the non-missing rate threshold calculated last time, and at this time, assuming that the non-missing rate threshold calculated last time is 0.5, the current non-missing rate threshold calculated this time is: 0.5 × 90% ═ 0.45.

S311, if the updated data sample meets the end condition, ending the operation; otherwise, returning to step S304 until the updated data sample meets an end condition, where the end condition includes: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

According to the method for processing data provided by the third embodiment of the invention, the target class is determined according to the non-loss rate of the attribute corresponding to the data sample loss value in each class, the data sample to be filled is determined according to the distance information between the data sample containing the loss value and the clustering center, and the current non-loss rate threshold value is updated after each clustering calculation is completed, so that all the loss values in the data sample can be filled, the accuracy of the filled filling value is improved, the accuracy and the effectiveness of the filled data sample are further improved, the processing speed of the loss value is improved, the time required for processing the loss value is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are further improved.

Example four

The fourth embodiment of the invention provides a device for processing data. The apparatus may be implemented by hardware and/or software, and may be generally integrated in a data processing platform, and may process data by performing a method of processing data. Fig. 4 is a block diagram illustrating a structure of an apparatus for processing data according to a third embodiment of the present invention, and as shown in fig. 4, the apparatus includes:

a sample information obtaining module 401, configured to obtain a data sample and attribute information of each attribute of the data sample, where the data sample includes a data sample that includes a missing value and a data sample that does not include the missing value;

a filling value determining module 402, configured to perform clustering calculation on the data samples according to the attribute information to determine filling values corresponding to various types of data samples, and update corresponding data samples according to the filling values;

a loop calling module 403, configured to end the operation if the updated data sample meets the end condition; otherwise, performing clustering calculation again on the updated data sample until the updated data sample meets an end condition, wherein the end condition comprises: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times, or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold value.

In the apparatus for processing data according to the fourth embodiment of the present invention, the sample information obtaining module obtains the data samples and the attribute information of each data sample, the fill value determining module performs cluster calculation on the data samples according to the attribute information of each data sample to determine the fill value corresponding to each data sample, updates the corresponding data sample according to the determined fill value, and determines whether the updated data sample meets the end condition or not through the loop call module; and if not, clustering calculation is carried out on the updated data sample again until the updated data sample meets the end condition. By adopting the above technical scheme, the filling values and the data samples containing the missing values corresponding to the filling values are determined through clustering calculation, so that the correctness of the filling values and the validity of data information can be improved, the processing speed of the missing values is improved, the time required for processing the missing values is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Further, the padding value determining module 402 includes: the cluster center determining unit is used for determining the current cluster center according to a set rule; the distance information determining unit is used for determining the distance information between each data sample and each current clustering center according to the attribute information; the data sample classification unit is used for classifying each data sample according to the distance information; and the data sample updating unit is used for determining a filling value according to the attribute value of the attribute corresponding to each missing data sample in the target class, and updating the data samples which meet the preset condition and contain the missing values in the target class based on the filling value, wherein the target class is the class of which the non-missing rate is greater than the current non-missing rate threshold, and the non-missing rate is the proportion of the data samples which do not contain the missing values in the data samples.

Further, the cluster center determining unit is specifically configured to: judging whether the data sample belongs to the class or not; if the current clustering center does not exist, acquiring at least two data samples which do not contain the missing value from the data samples which do not contain the missing value as the current clustering center; and if so, calculating the centroid point of each class as the current clustering center according to the attribute information of each attribute of the data samples contained in each class.

Further, the distance information determining unit is specifically configured to: determining distance information between the data sample and each current clustering center according to the attribute value of each relevant attribute of the data sample; or determining distance information between the data sample and each front clustering center according to the attribute value of each relevant attribute of the data sample and the contribution degree of each relevant attribute; and the related attribute is the related attribute of the attribute corresponding to the missing value.

Further, the data sample update unit includes: the target class determining subunit is used for determining a target class which can be filled with the missing value according to the non-missing rate of the attribute corresponding to the missing value in each class; a to-be-filled sample determining subunit, configured to determine to-be-filled data samples based on distance information between data samples in the target class each including a missing value and a current clustering center of the target class; and the to-be-filled sample updating subunit is used for determining a filling value according to the attribute value of the attribute corresponding to the missing value of the data sample which does not contain the missing value in the target class, and updating the to-be-filled data sample according to the filling value.

Further, the data sample update unit may further include: and the threshold updating unit is used for determining a filling value according to the attribute value of the attribute corresponding to the missing value of each data sample in the target class, updating the missing value in the data sample which meets the preset condition and contains the missing value in the target class based on the filling value, and updating the current non-missing rate threshold based on a set updating rule.

The device for processing data provided by the embodiment can execute the method for processing data provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the method for processing data. For details of the technique not described in detail in this embodiment, reference may be made to the method for processing data provided in any embodiment of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of processing data, comprising:

determining the current clustering center according to a set rule;

determining distance information between each data sample and each current clustering center according to the attribute information;

classifying each data sample according to the distance information;

determining a target class capable of being filled with the missing value according to the non-missing rate of the attribute corresponding to the missing value in the classes, wherein the non-missing rate is greater than the current non-missing rate threshold value, and the non-missing rate is the proportion of data samples not containing the missing value in the data samples;

determining data samples to be filled based on distance information between data samples in the target class each containing a missing value and a current cluster center of the target class;

determining a filling value according to attribute values of attributes corresponding to missing values of data samples not containing the missing values in a target class, and updating the data samples to be filled according to the filling value;

2. The method of claim 1, wherein determining the current cluster center according to the set rule comprises:

judging whether the data sample belongs to the class or not;

if the current clustering center does not exist, acquiring at least two data samples which do not contain the missing value from the data samples which do not contain the missing value as the current clustering center;

and if so, calculating the centroid point of each class as the current clustering center according to the attribute information of each attribute of the data samples contained in each class.

3. The method according to claim 1, wherein the determining of the distance information between each data sample and each current cluster center according to the attribute information includes:

determining distance information between the data sample and each current clustering center according to the attribute value of each relevant attribute of the data sample; alternatively, the first and second electrodes may be,

determining distance information between the data sample and each current clustering center according to the attribute value of each relevant attribute of the data sample and the contribution degree of each relevant attribute;

and the related attribute is the related attribute of the attribute corresponding to the missing value.

4. The method according to claim 1, further comprising, after said updating said data samples to be padded according to said padding values:

updating the current non-miss rate threshold based on a set update rule.

5. An apparatus for processing data, comprising:

the cyclic calling module is used for ending the operation if the updated data sample meets the ending condition; otherwise, performing clustering calculation again on the updated data sample until the updated data sample meets an end condition, wherein the end condition comprises: the cluster center of the updated data sample is the same as the cluster center of the last cluster calculation, the cluster calculation times reach the preset calculation times or the proportion of the data samples which do not contain the missing value in the updated data sample reaches the set threshold;

wherein the padding value determining module comprises:

the cluster center determining unit is used for determining the current cluster center according to a set rule;

the distance information determining unit is used for determining the distance information between each data sample and each current clustering center according to the attribute information;

the data sample classification unit is used for classifying each data sample according to the distance information;

the data sample updating unit is used for determining a filling value according to an attribute value of an attribute corresponding to each missing value of the data samples in the target class, and updating the data samples which meet preset conditions and contain the missing values in the target class based on the filling value, wherein the target class is a class of which the non-missing rate is greater than a current non-missing rate threshold, and the non-missing rate is the proportion of the data samples which do not contain the missing values in the data samples;

the data sample update unit includes:

the target class determining subunit is used for determining a target class which can be filled with the missing value according to the non-missing rate of the attribute corresponding to the missing value in each class;

a to-be-filled sample determining subunit, configured to determine to-be-filled data samples based on distance information between data samples in the target class each including a missing value and a current clustering center of the target class;

and the to-be-filled sample updating subunit is used for determining a filling value according to the attribute value of the attribute corresponding to the missing value of the data sample which does not contain the missing value in the target class, and updating the to-be-filled data sample according to the filling value.

6. The apparatus according to claim 5, wherein the distance information determining unit is specifically configured to: