CN106919706A

CN106919706A - Data updating method and device

Info

Publication number: CN106919706A
Application number: CN201710142462.0A
Authority: CN
Inventors: 徐骄
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-03-10
Filing date: 2017-03-10
Publication date: 2017-07-04

Abstract

The embodiment of the invention discloses a method and a device for updating data. The method comprises the following steps: acquiring first occurrence rate information of missing attributes and non-missing attributes in a data sample; calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information; and determining a padding value corresponding to the first data sample according to the second occurrence rate information, and updating the first data sample according to the padding value. By adopting the technical scheme, the filling value corresponding to the missing value in the data sample containing the missing value is determined according to the occurrence rate information of each attribute value corresponding to the missing attribute in the data sample containing the missing value, so that the correctness of the filling value and the effectiveness of data information can be improved, the processing speed of the missing value is improved, the time for processing the missing value is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Description

Data updating method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for updating data.

Background

In recent years, with the development of information processing technology, large data is increasingly applied to various fields such as navigation systems and city planning.

The current big data architecture generally performs data processing by taking data flow as guidance, that is, firstly, data is acquired from a data source and stored, then the data is preprocessed, and then data modeling, data analysis and data mining are performed according to the preprocessed data, and finally data change is realized. Therefore, data preprocessing is the basis of the whole data processing process in a big data structure, the quality and the precision of the data preprocessing can directly influence the index definition of data dimension modeling, the selection of a data mining algorithm or the accuracy measurement of data in the subsequent links, and the data preprocessing is one of the important links of the data processing process.

In the prior art, when processing data, the missing value in the data is generally processed by methods such as manual filling, deleting recorded data containing the missing value (i.e., an erasure method), filling with special characters (e.g., NULL), or filling the missing value using a statistical mean or a mode. However, when the data volume is large or reaches a certain level, the manual filling needs to consume much time and energy, and the requirements of real-time and rapid transmission and processing of the data stream cannot be met; deleting recorded data containing missing values, using uniform special characters or using statistical mean or mode to fill missing values has no pertinence, which can cause the data accuracy and effectiveness to be reduced, thus the prior art can not meet the requirements of high efficiency and high precision of missing value processing at the same time.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for updating data, so as to solve the technical problem that a data processing method in the prior art cannot simultaneously meet requirements of high efficiency and high precision of missing value processing.

In a first aspect, an embodiment of the present invention provides a method for updating data, including:

acquiring first occurrence rate information of missing attributes and non-missing attributes in data samples, wherein the data samples comprise a first data sample containing a missing value and a second data sample not containing the missing value, and the missing attributes are attributes corresponding to the missing values in the first data sample;

calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information, wherein the second occurrence rate information is the occurrence rate information of each attribute value corresponding to the missing attribute appearing in the first data sample;

and determining a padding value corresponding to the first data sample according to the second occurrence rate information, and updating the first data sample according to the padding value.

In a second aspect, an embodiment of the present invention further provides a device for updating data, including:

a first occurrence rate information obtaining module, configured to obtain first occurrence rate information of a missing attribute and a non-missing attribute in a data sample, where the data sample includes a first data sample including a missing value and a second data sample not including the missing value, and the missing attribute is an attribute corresponding to the missing value in the first data sample;

a second occurrence rate information calculating module, configured to calculate second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information, where the second occurrence rate information is occurrence rate information of each attribute value corresponding to the missing attribute appearing in the first data sample;

and the data sample updating module is used for determining a filling value corresponding to the first data sample according to the second occurrence rate information and updating the first data sample according to the filling value.

According to the technical scheme for updating data provided by the embodiment of the invention, first occurrence rate information of missing attributes and non-missing attributes in a data sample is obtained, second occurrence rate information of attribute values corresponding to the missing attributes appearing in the data sample containing the missing values is calculated according to the obtained first occurrence rate information, filling values corresponding to the missing values in the data sample containing the missing values are determined according to the second occurrence rate information, and the data sample containing the missing values is updated according to the filling values. By adopting the technical scheme, the filling value corresponding to the missing value in the data sample containing the missing value is determined according to the occurrence rate information of each attribute value corresponding to the missing attribute in the data sample containing the missing value, so that the correctness of the filling value and the effectiveness of data information can be improved, the processing speed of the missing value is improved, the time for processing the missing value is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

fig. 1 is a flowchart illustrating a method for updating data according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a data updating method according to a second embodiment of the present invention;

fig. 3 is a block diagram of a data updating apparatus according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings.

Example one

The embodiment of the invention provides a data updating method. The method may be performed by a data updating apparatus, wherein the apparatus may be implemented by hardware and/or software, and may generally be integrated in a data processing platform. Fig. 1 is a schematic flowchart of a method for updating data according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s110, obtaining first occurrence rate information of missing attributes and non-missing attributes in data samples, wherein the data samples comprise first data samples containing missing values and second data samples not containing the missing values, and the missing attributes are attributes corresponding to the missing values in the first data samples.

In this embodiment, the missing attribute is an attribute corresponding to a missing value in a data sample containing the missing value, and correspondingly, the non-missing attribute is an attribute corresponding to a non-missing value in a data sample containing the missing value. Wherein the data sample may be an entity class data sample. The specific definitions of the data samples containing the missing values and the data samples not containing the missing values may be flexibly set according to different processing manners, for example, the data samples lacking any one or more attribute values may be defined as the data samples containing the missing values, and correspondingly, the data samples not containing the missing values may be defined as the data samples not lacking all attribute values; when a missing value included in an attribute of a data sample is processed, only the data sample in which the attribute value of the attribute is missing may be defined as the data sample including the missing value, and correspondingly, the data sample not including the missing value may be defined as the data sample in which the attribute value of the attribute is not missing or the data sample in which all the attribute values are not missing.

In this embodiment, when processing missing values included in a data sample, the processing may be performed in the horizontal or vertical order, that is, in units of data samples, or in units of attributes, and this is not limited here. When processing missing values in data samples, data samples with the same attribute corresponding to the missing values and the same attribute value of the non-missing attribute may be grouped together, and when processing the missing values, the missing values in the group of data samples may be processed at the same time. The non-missing attribute may be all other attributes except the missing attribute, or may be a related attribute of the missing attribute.

In view of simplicity of calculation, it is preferable that the non-missing attribute is a correlation attribute of the missing attribute. Accordingly, when a missing value included in an attribute of a certain data sample is processed, the data sample may be defined as only a data sample including the missing value (i.e., a first data sample), and a data sample not including the missing value but not including any attribute value of the relevant attribute (i.e., a second data sample). The correlation attribute of a certain missing attribute can be flexibly set by a developer or an operator according to needs, and can also be determined according to the correlation information between each correlation attribute of the data sample and the missing attribute, and the correlation information can be obtained by counting the probability that the attribute values of other attributes change when the corresponding attribute value of the missing attribute changes.

The first occurrence rate information of the missing attribute and the non-missing attribute may be occurrence rate information of each attribute value corresponding to the missing attribute in the data sample, occurrence rate information of each attribute value corresponding to the missing attribute in the second data sample, occurrence rate information of each attribute value of the non-missing attribute in the second data sample, or conditional probability information of each attribute value of the non-missing attribute when the missing attribute is used as a condition, and is not limited herein. In consideration of the practicability and the simplicity of calculation of each occurrence rate information, preferably, the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to a missing attribute in the second data sample and second sub-occurrence rate information of each attribute value of non-missing attributes in the first data sample in the second data sample on condition that the attribute value corresponding to the missing attribute is used as a condition; or, the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to the missing attribute in the second data sample, second sub-occurrence rate information of each attribute value of non-missing attributes in the first data sample in the second data sample on condition that the attribute value corresponding to the missing attribute is the attribute value corresponding to the missing attribute, and weight value information corresponding to the attribute value of the non-missing attribute in the first data sample. The weight value information corresponding to the attribute value of a certain non-missing attribute of the first data sample may be flexibly set by a developer or an operator according to needs, and may also be determined according to the occurrence rate information of the attribute value in the second data sample, which is not limited herein. In view of simplicity of setting, it is preferable that the occurrence rate information of the attribute value of the non-missing attribute of the first data sample in the second data sample is used as the weight value information corresponding to the attribute value of the non-missing attribute of the first data sample.

In this embodiment, when processing a missing value in a certain first data sample, first occurrence rate information of a missing attribute and a non-missing attribute in the first data sample may be calculated and obtained in real time, or first occurrence rate information of all attributes of the data sample may be calculated in advance and stored in a database corresponding to a data processing platform, and when processing a missing value in a certain first data sample, first occurrence rate information of a missing attribute and a non-missing attribute corresponding to a missing value of the first data sample may be directly called from the database, which is not limited herein. In order to avoid the situation that the first occurrence rate information of each attribute needs to be repeatedly calculated when different first data samples are processed, it is preferable that the first occurrence rate information of each attribute of the data samples is calculated and stored in advance.

And S120, calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information, wherein the second occurrence rate information is the occurrence rate information of each attribute value corresponding to the missing attribute appearing in the first data sample.

In this embodiment, the second occurrence rate information of the attribute value corresponding to the missing attribute of the first data sample may be flexibly set as required, for example, the second occurrence rate information of the attribute value corresponding to the missing attribute of the first data sample may be a ratio of a product of conditional probability information of a non-missing attribute value of the first data sample conditioned on an attribute value corresponding to the missing attribute and occurrence rate information of the attribute value corresponding to the missing attribute in the second data sample to occurrence rate information of each non-missing attribute value of the first data sample in the second data sample, that is, the second occurrence rate information of the attribute value corresponding to the missing valueWherein,the attribute value y corresponding to the missing attribute is used as the non-attribute value of the first data sample_iIs the conditional probability information of the condition(s),and (c) occurrence information of non-missing attribute values in the second data sample for the first data sample. For the same group of first data samples with the same or non-missing attribute value, when calculating second occurrence rate information of different attribute values corresponding to the missing attribute, the occurrence rate information p (r) of the non-missing attribute value in the second data sample is the same, so, in view of simplicity of calculation, it is preferable that the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to the missing attribute in the second data sample and second sub-occurrence rate information of each attribute value of the non-missing attribute in the first data sample in the second data sample on condition that the attribute value corresponding to the missing attribute is the same; the calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information includes: according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iP (y) of the second occurrence rate information_i) The attribute value y corresponding to the missing attribute_iP (R) is the first sub-occurrence rate information of_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jIs the attribute value of the non-missing attribute in the first data sample.

Taking the non-missing attribute as an example of the related attribute, assuming that the missing attribute is a marital status attribute, and the corresponding attribute values are unerried, marred, and divorced (the occurrence rates in the second data samples are 0.3, 0.6, and 0.1, respectively), the related attributes corresponding to the missing attribute are gender and whether to buy a house, the related attribute values of a certain first data sample are male and buy a house, respectively, the data sample occurrence rates of male in the second data samples whose attribute values corresponding to the missing attribute are unerried, marred, and divorced are 0.4, 0.7, and 0.5, respectively, the data sample occurrence rates of whether to buy a house in the second data samples whose attribute values corresponding to the missing attribute are unerried, marred, and divorced are 0.3, 0.7, and 0.5, respectively, then, in the first data sample, the probability that the attribute value corresponding to the missing attribute is unerried is not married: p (unmarried | R) ═ P (unmarried) P (male | unmarried) P (buying room | unmarried) ═ 0.3 × 0.4 × 0.3 ═ 0.036, and similarly, the probability that the attribute value corresponding to the missing attribute is married is: p (married | R) ═ 0.294, the probability that the attribute value corresponding to the missing attribute is marred is: p (dissimilarity | R) ═ 0.025.

In this embodiment, the second occurrence rate information of a certain attribute value corresponding to the missing attribute of the first data sample may also be conditional probability information of the non-missing attribute value of the first data sample conditioned on the attribute value corresponding to the missing attribute, a product of the occurrence rate information of the attribute value corresponding to the missing attribute in the second data sample and the weight value information corresponding to the attribute value of the non-missing attribute of the first data sample, and the first data sampleThe ratio of the occurrence rate information of each non-missing attribute value in the second data sample, i.e. the second occurrence rate information of the attribute value corresponding to the missing valueWherein,and the weight value information corresponding to the attribute value corresponding to the non-attribute value of the first data sample. For the same group of first data samples with the same or non-missing attribute value, when calculating second occurrence rate information of different attribute values corresponding to the missing attribute, the occurrence rate information p (r) of the non-missing attribute value in the second data sample is the same, and therefore, in view of the simplicity of calculation, it is preferable that the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to the missing attribute in the second data sample, second sub-occurrence rate information of each attribute value of the non-missing attribute in the first data sample in the second data sample on condition that the attribute value corresponding to the missing attribute is the attribute value in the second data sample, and weight value information corresponding to the attribute value of the non-missing attribute of the first data sample; the calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information includes: according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iSecond occurrence rate information, P (y)_i) The attribute value y corresponding to the missing attribute_iFirst sub-occurrence information of (W)_jAs attribute value R_jCorresponding weight value information, P (R)_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jAttribute values for non-missing attributes in the first data sample.

Taking the non-missing attribute as an example of the related attribute, assuming that the missing attribute is a marital status attribute, the corresponding attribute values are not married, married and divorced (the occurrence rates in the second data sample are 0.3, 0.6 and 0.1 respectively), the related attribute values corresponding to the missing attribute are gender and whether to buy a house, the related attribute values of a certain first data sample are male and buy a house (the corresponding weight value information is 0.6 and 0.7 respectively), the data sample occurrence rates for males in the second data samples for which the missing attributes correspond to attribute values of unmarried, marred, and dissimilarity, respectively, are 0.4, 0.7, and 0.5, if the occurrence rates of the data samples with the house-buying attribute values of 0.3, 0.7 and 0.5 in the second data samples with the attribute values corresponding to the missing attributes of unmarried, marred and divorced respectively are 0.3, 0.7 and 0.5, the probability that the attribute value corresponding to the missing attribute is unmarried in the first data sample is as follows:

p (unmarried | R) ═ P (unmarried) W (male) P (male | unmarried) W (buying room) P (buying room | unmarried),

＝0.3×0.6×0.4×0.7×0.3＝0.1512

similarly, the probability that the attribute value corresponding to the missing attribute is married is: p (married | R) ═ 0.12348, and the probability that the attribute value corresponding to the missing attribute is marred is: p (dissimilarity | R) ═ 0.0105.

S130, determining a filling value corresponding to the first data sample according to the second occurrence rate information, and updating the first data sample according to the filling value.

In this embodiment, when a missing value in a first data sample or a group of first data samples with completely the same non-missing attribute value is processed, an attribute value corresponding to a missing attribute with the largest second occurrence rate information in the first data sample or the group of first data samples may be selected as a padding value, and the padding value is padded to a position of the missing attribute of the first data sample or the group of first data samples to update the first data sample or the group of first data samples. For example, suppose that the missing attribute of a certain first data sample is a marital status attribute, the corresponding attribute values are ungraded, married and divorce, and the second occurrence rate information of the attribute value corresponding to each missing attribute is: p (unmarried | R) ' 0.036, P (marred | R) ' 0.294, P (dissimilarity | R) ' 0.025, P (marred | R) > P (unmarried | R) > P (dissimilarity | R), the fill value for the first data sample marital status attribute is "marred".

The first method for updating data provided in the embodiment of the present invention obtains first occurrence rate information of a missing attribute and a non-missing attribute in a data sample, calculates second occurrence rate information of each attribute value corresponding to the missing attribute appearing in the data sample including the missing value according to the obtained first occurrence rate information, determines a padding value corresponding to the missing value in the data sample including the missing value according to the second occurrence rate information, and updates the data sample including the missing value according to the padding value. By adopting the technical scheme, the filling value corresponding to the missing value in the data sample containing the missing value is determined according to the occurrence rate information of each attribute value corresponding to the missing attribute in the data sample containing the missing value, so that the correctness of the filling value and the effectiveness of data information can be improved, the processing speed of the missing value is improved, the time for processing the missing value is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

On the basis of the foregoing embodiment, before the obtaining the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample, the method further includes: the second data sample is trained to determine first occurrence information for missing attributes and non-missing attributes in the data sample. In this embodiment, the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample may be directly calculated and determined according to the second data sample in the data sample; or the second data sample in the data sample may be divided into a training sample and a test sample, the training sample is trained to obtain first occurrence rate information of a missing attribute and a non-missing attribute, and the first occurrence rate information obtained by training is tested by using the test sample to determine the accuracy of the first occurrence rate information obtained by training, which is not limited herein. Preferably, the second data sample may be divided into a training sample and a testing sample, the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample is obtained through the training sample, the obtained first occurrence rate information is tested through the testing sample to determine the accuracy of the obtained first occurrence rate information, and the second data sample is retrained when the accuracy of the first occurrence rate information does not meet the set condition, so that the accuracy of the obtained first occurrence rate information is ensured, and the accuracy of the filling value corresponding to the missing attribute is further improved.

Example two

Fig. 2 is a flowchart illustrating a data updating method according to a second embodiment of the present invention. The embodiment is optimized on the basis of the foregoing embodiment, and further, the training the second data sample to determine the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample includes: dividing the second data sample into a training sample set and a testing sample set according to a set proportion; training the training sample set to determine current occurrence rate information of missing attributes and non-missing attributes in the data samples; testing the current occurrence rate information by adopting the test sample set to generate a test result; if the test result meets the set accuracy threshold, ending the training operation; otherwise, the training sample set and the test sample set are divided again, and the newly divided training sample set is trained until the test result meets the set accuracy threshold; and marking the current occurrence rate information when the training is finished as first occurrence rate information of the missing attribute and the non-missing attribute in the data sample.

Correspondingly, as shown in fig. 2, the method for updating data provided by this embodiment includes:

and S210, dividing the second data sample into a training sample set and a testing sample set according to a set proportion.

In this embodiment, the setting ratio for dividing the training sample set and the testing sample set may be flexibly set according to the number of the non-missing attributes or the number of the first data samples, for example, the setting ratio may be set to 1:1, that is, 50% of the second data samples are used as the training samples and added to the training sample set, and the remaining 50% of the second data samples are used as the testing samples and added to the testing sample set. It should be noted that, when the number of the data samples is small or the proportion of the first data samples in the data samples is large, the proportion of the training sample set in the second data samples may be increased appropriately, for example, the proportion of the training samples to the test samples in the second data samples may be adjusted to be a 6:4, 7:3 or 8:2 scaling factor, which is not limited herein.

When the second data samples are divided into a training sample set and a testing sample set, the second data samples with a set proportion can be randomly selected as the training samples to be added into the training sample set, and the rest second data samples are taken as the testing samples to be added into the testing sample set; or selecting the second data samples with set proportionality coefficients from front to back, from back to front or from one or more than one every interval in a certain sequence, and selecting one or more than one every interval in a certain sequence to be used as training samples to be added into the training sample set, and adding the rest second data samples as test samples into the test sample set.

S220, training the training sample set to determine the current occurrence rate information of the missing attributes and the non-missing attributes in the data samples.

In this embodiment, the first occurrence rate information of the missing attribute and the non-missing attribute may be determined according to the proportion of the attribute values corresponding to the missing attribute appearing in the training sample, the proportion of the non-missing attribute values appearing when the missing attribute takes different attribute values, and/or the proportion of the non-missing attribute values appearing in the training sample. If the attribute value corresponding to the missing attribute is a continuous numerical value, discretizing the attribute value of the missing attribute, for example, discretizing the age attribute into discrete intervals of 20-25, 25-30, 30-35, 35-40, etc., and then calculating each missing attribute after discretizationIf the attribute values corresponding to the non-missing attributes are continuous values, the mean η and the variance σ of the non-missing attributes can be obtained according to the training sample, so as to obtain the positive-Taire distribution formula of the non-missing attributesAnd then calculating the proportion of the non-missing attributes appearing in the training sample according to the obtained normal distribution formula and the proportion of the non-missing attribute values appearing when the missing attributes take different attribute values. Taking non-missing attribute as payroll attribute as an example, the attribute value y is taken in the calculation of missing attribute_iWhen the proportion of each non-missing attribute value appears, the attribute value of each missing attribute of the training sample set can be firstly calculated to be y_iAverage of training sample payroll attributesAnd standard deviation ofThen according to the formulaCalculating the missing attribute to get the attribute value y_iThe proportion of each non-missing attribute value; taking the weight value information as the occurrence rate information of each non-missing attribute value in the second data sample as an example, when calculating the weight value information corresponding to each non-missing attribute value, the average value of the payroll attributes of all training samples in the training sample set can be calculated firstAnd standard deviation ofThen according to the formulaComputing a non-missing attribute value R_kOn-the-fly training sampleAnd taking the calculated occurrence rate information as the non-missing attribute value R_kAnd corresponding weight value information.

Illustratively, in calculating the first occurrence rate information of the missing attribute and the non-missing attribute, assuming that the missing attribute is a marital status (attribute values of unerried, marred, and divorced, respectively), the non-missing attribute is a gender (attribute values of male and female, respectively), of 5000 training samples, 2650 training samples for the gender as male, 2350 training samples for the gender as female, 2000 training samples for the unerried status (wherein the male is 1100, and the female is 900), 2700 training samples for the marital status (wherein the male is 1400, and the female is 1300), 300 training samples for the divorced status (wherein the male is 150, and the female is 150), a ratio P (unerried) where the attribute value corresponding to the missing attribute appears in the training samples is 0.4, and similarly, a ratio P (unerried) where the attribute value corresponding to the missing attribute value "unerried" appears in the training samples is 2000/5000-0.4, and a ratio P (divorced ratio) where the attribute value corresponding to the missing attribute value "has appeared in the" unerried "and the" has appeared in the training samples (P) are present in the training samples (the divorced sample) Wedding 2700/5000-0.54, P (dissimilarity) 300/5000-0.06; in the training sample whose attribute value corresponding to the missing attribute is unmarried, the proportion P of occurrences of "male" and "female" (male | unmarried) '1100/2000 is 0.55, P (female | unmarried)' 900/2000 is 0.45, and similarly, in the training sample whose attribute value corresponding to the missing attribute is marred, the proportion P of occurrences of "male" and "female" (male | marred) '1400/2700 is approximately 0.52, P (female | married)' 1300/2700 is approximately 0.48, and in the training sample whose attribute value corresponding to the missing attribute is equal to or less than the attribute value, the proportion P of occurrences of "male" and "female" (male | dissimilarity) 'P (female | dissimilarity)' 150/300 is 0.5.

It should be noted that, when calculating the proportion of each non-missing attribute value when the missing attribute in the training sample set takes different attribute values, if a certain non-missing attribute value R appears_kAttribute value y corresponding to a certain missing attribute_iA ratio P (R) occurring under the conditions of (1)_k|y_i) Is 0, i.e. a certain attribute value y corresponding to the missing attribute_iThe next non-missing attribute value R_kThere is no case and it needs to be calibrated to prevent the accuracy of the data sample from being degraded. Here, the rule used for calibration may be flexibly set as needed, for example, Laplace (Laplace) may be used for calibration, that is, when P (R) occurs_k|y_i) In the case of 0, the numerator of the ratio is increased by a set value, the denominator is increased by the product of the set value and the number of non-missing attributes, and the ratio of the numerator and the denominator increased by different values is used as the non-missing attribute value R_kAttribute value y corresponding to missing attribute_iIs a ratio that appears under the condition, wherein the increased set value may be set to 1, 2, 5 or other values as necessary. For example, if the set value is 1, the missing attribute is a marital status attribute, and the total number of non-missing attributes is 5, where the gender attribute is one of 5 non-missing attributes, 40 training samples with marital status as married are provided in the training samples, and the gender attribute values of the 40 training samples are all male, then the probability of female occurrence in the training samples with marital status as married is assumed at this timeIn this case, the numerator of the ratio may be increased by 1, and the denominator may be increased by 5, which is the number of non-missing attributes, so as to calibrate the probability of female appearance in the training sample whose marital status is married, and the calibrated probability of female appearance in the training sample whose marital status is married

And S230, testing the current occurrence rate information by adopting the test sample set to generate a test result.

Illustratively, when a certain test sample is used to test the current occurrence rate information when a certain attribute is taken as the missing attribute, second occurrence rate information of each attribute value corresponding to the missing attribute of the test sample can be calculated according to the current occurrence rate information obtained by training the training sample, and determining an attribute value corresponding to the missing attribute of the test sample according to the obtained second occurrence rate information, then, judging whether the attribute value determined according to the second occurrence rate information is the same as the real attribute value corresponding to the missing attribute in the test sample, if so, and judging that the current occurrence rate information is accurate to the test sample, otherwise, judging that the current occurrence rate information is inaccurate to the test sample, and so on until all the test samples in the test sample set are tested and the test result is generated. Here, it should be noted that, whether the filling value of each test sample is accurate or not may be used as the test result, and the accuracy information of the filling value determined according to whether the filling value of each test sample is accurate or not may be used as the test result, which is not limited herein. In view of the simplicity of the subsequent calculation, it is preferable that the test result be accuracy information of the padding value.

S240, if the test result meets a set accuracy threshold, ending the training operation; otherwise, the training sample set and the test sample set are divided again, and the newly divided training sample set is trained until the test result meets the set accuracy threshold.

In this embodiment, the set accuracy threshold may be flexibly set as needed, for example, the accuracy threshold may be set to a value of 95%, 98%, or 100%. Illustratively, assuming that the accuracy threshold is 98%, at this time, if the test result is greater than or equal to 98%, the training operation is ended; and if the test result is less than 98%, the training sample set is divided again, and the newly divided training sample set is trained again to determine the current occurrence rate information again. Here, it should be noted that, when the training sample set and the test sample set are re-divided, the training sample set and the test sample set may be re-divided according to the set proportion used last time; it is also possible to reset the set ratio for dividing the training sample set and the test sample set, and to re-divide the training sample set and the test sample set according to the reset set ratio, which is not limited herein.

And S250, marking the current occurrence rate information after training is finished as first occurrence rate information of missing attributes and non-missing attributes in the data sample.

For example, when training is finished, the current occurrence rate information may be marked, the occurrence rate information obtained by training before the training is deleted, or the current occurrence rate information is stored in a set storage location, so that the purpose of marking the current occurrence rate information when training is finished as the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample is achieved.

S260, acquiring first occurrence rate information of missing attributes and non-missing attributes in data samples, wherein the data samples comprise first data samples containing missing values and second data samples not containing the missing values, and the missing attributes are attributes corresponding to the missing values in the first data samples.

And S270, calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information, wherein the second occurrence rate information is the occurrence rate information of each attribute value corresponding to the missing attribute appearing in the first data sample.

S280, determining a filling value corresponding to the first data sample according to the second occurrence rate information, and updating the first data sample according to the filling value.

The second data sample is divided into a training sample set and a testing sample set according to a set proportion, the training sample set is trained to determine the current occurrence rate information of the missing attribute and the non-missing attribute of the data sample, the obtained current occurrence rate information is tested by adopting the testing sample set to generate a testing result, if the testing result accords with a set accuracy rate threshold value, the current occurrence rate information is marked as first occurrence rate information of the missing attribute and the non-missing attribute in the data sample, and the missing value of the first data sample is filled according to the filling value determined by the first occurrence rate information; otherwise, the training sample and the testing sample are divided again until the result meets the set accuracy threshold value. By adopting the above technical scheme, the first occurrence rate information of the missing attribute and the non-missing attribute is determined by training, and the accuracy of the obtained first occurrence rate information can be ensured, so that the correctness of the filling value and the validity of the data information are further improved, the processing speed of the missing value is improved, and the time required for processing the missing value is reduced.

EXAMPLE III

The third embodiment of the invention provides a data updating device. The device can be realized by hardware and/or software, can be generally integrated in a data processing platform, and can process data by a method for performing data updating. Fig. 3 is a block diagram illustrating a structure of a data updating apparatus according to a third embodiment of the present invention, and as shown in fig. 3, the apparatus includes:

a first occurrence rate information obtaining module 310, configured to obtain first occurrence rate information of a missing attribute and a non-missing attribute in data samples, where the data samples include a first data sample including a missing value and a second data sample not including the missing value, and the missing attribute is an attribute corresponding to the missing value in the first data sample;

a second occurrence rate information calculating module 320, configured to calculate second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information, where the second occurrence rate information is occurrence rate information of each attribute value corresponding to the missing attribute appearing in the first data sample;

a data sample updating module 330, configured to determine a padding value corresponding to the first data sample according to the second occurrence rate information, and update the first data sample according to the padding value.

In the apparatus for updating data provided in the third embodiment of the present invention, the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample is obtained by the first occurrence rate information obtaining module, the second occurrence rate information of each attribute value corresponding to the missing attribute appearing in the data sample including the missing value is calculated by the second occurrence rate information calculating module according to the obtained first occurrence rate information, the padding value corresponding to the missing value in the data sample including the missing value is determined by the data sample updating module according to the second occurrence rate information, and the data sample including the missing value is updated according to the padding value. By adopting the above technical solution, the filling value corresponding to the missing value in the data sample containing the missing value is determined according to the occurrence rate information of each attribute value corresponding to the missing attribute in the data sample containing the missing value, so that the correctness of the filling value and the validity of the data information can be improved, the processing speed of the missing value is improved, the time required for processing the missing value is reduced, and the accuracy of the subsequent data processing flow and the average speed of the whole data processing process are improved.

On the basis of the above embodiment, the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to a missing attribute in the second data sample and second sub-occurrence rate information of each attribute value of non-missing attributes in the first data sample in the second data sample on condition that the attribute value corresponding to the missing attribute is used as the attribute value; the second occurrence information calculation module 320 is configured to: according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iP (y) of the second occurrence rate information_i) The attribute value y corresponding to the missing attribute_iP (R) is the first sub-occurrence rate information of_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jIs the attribute value of the non-missing attribute in the first data sample.

In addition to the foregoing embodiment, the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to the missing attribute in a second data sample, and the first numberSecond sub-occurrence rate information and weight value information corresponding to the attribute value of the non-missing attribute of the first data sample, wherein the attribute value of each non-missing attribute in the data sample is in a second data sample under the condition of the attribute value corresponding to the missing attribute; the second occurrence information calculation module 320 is configured to: according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iSecond occurrence rate information, P (y)_i) The attribute value y corresponding to the missing attribute_iFirst sub-occurrence information of (W)_jAs attribute value R_jCorresponding weight value information, P (R)_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jAttribute values for non-missing attributes in the first data sample.

On the basis of the foregoing embodiment, the apparatus for updating data provided in this embodiment may further include: and a first occurrence rate information determining module, configured to train the second data sample to determine first occurrence rate information of the missing attribute and the non-missing attribute in the data sample before the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample is obtained.

On the basis of the above embodiment, the first occurrence rate information determination module includes: the data sample dividing unit is used for dividing the second data sample into a training sample set and a testing sample set according to a set proportion; a current occurrence rate information determining unit, configured to train the training sample set to determine current occurrence rate information of missing attributes and non-missing attributes in data samples; the occurrence rate information testing unit is used for testing the current occurrence rate information by adopting the test sample set to generate a test result; the cyclic calling unit is used for ending the training operation if the test result meets a set accuracy threshold; otherwise, the training sample set and the test sample set are divided again, and the newly divided training sample set is trained until the test result meets the set accuracy threshold; and the first occurrence rate information marking unit is used for marking the current occurrence rate information when the training is finished as the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample.

On the basis of the above embodiment, the non-missing attribute is a related attribute of the missing attribute.

The data updating device provided by the embodiment can execute the data updating method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the data updating method. For technical details that are not described in detail in this embodiment, reference may be made to a method for updating data provided in any embodiment of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of data updating, comprising:

2. The method according to claim 1, wherein the first occurrence rate information includes first sub-occurrence rate information of each attribute value corresponding to the missing attribute in the second data sample and second sub-occurrence rate information of each attribute value of the non-missing attribute in the first data sample in the second data sample conditioned on the attribute value corresponding to the missing attribute;

the calculating second occurrence rate information of each attribute value corresponding to the missing attribute according to the first occurrence rate information includes:

according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iP (y) of the second occurrence rate information_i) The attribute value y corresponding to the missing attribute_iP (R) is the first sub-occurrence rate information of_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jIs the attribute value of the non-missing attribute in the first data sample.

3. The method according to claim 1, wherein the first occurrence information includes first sub-occurrence information of each attribute value corresponding to the missing attribute in a second data sample, second sub-occurrence information of each attribute value of non-missing attributes in the first data sample in the second data sample conditioned on the attribute value corresponding to the missing attribute, and weight value information corresponding to the attribute value of non-missing attributes of the first data sample;

according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iSecond occurrence rate information, P (y)_i) The attribute value y corresponding to the missing attribute_iFirst sub-occurrence information of (W)_jAs attribute value R_jCorresponding weight value information, P (R)_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jAttribute values for non-missing attributes in the first data sample.

4. The method of claim 1, further comprising, prior to said obtaining first occurrence information for missing attributes and non-missing attributes in data samples:

the second data sample is trained to determine first occurrence information for missing attributes and non-missing attributes in the data sample.

5. The method of claim 4, wherein training the second data sample to determine the first occurrence information of missing attributes and non-missing attributes in the data sample comprises:

dividing the second data sample into a training sample set and a testing sample set according to a set proportion;

training the training sample set to determine current occurrence rate information of missing attributes and non-missing attributes in the data samples;

testing the current occurrence rate information by adopting the test sample set to generate a test result;

if the test result meets the set accuracy threshold, ending the training operation; otherwise, the training sample set and the test sample set are divided again, and the newly divided training sample set is trained until the test result meets the set accuracy threshold;

and marking the current occurrence rate information when the training is finished as first occurrence rate information of the missing attribute and the non-missing attribute in the data sample.

6. The method of any of claims 1-5, wherein the non-missing attribute is a correlation attribute of a missing attribute.

7. An apparatus for updating data, comprising:

8. The apparatus according to claim 7, wherein the first occurrence information includes first sub-occurrence information of each attribute value corresponding to the missing attribute in the second data sample and second sub-occurrence information of each attribute value of the non-missing attribute in the first data sample in the second data sample conditioned on the attribute value corresponding to the missing attribute;

the second occurrence information calculation module is configured to:

9. The apparatus according to claim 7, wherein the first occurrence information includes first sub-occurrence information of each attribute value corresponding to the missing attribute in a second data sample, second sub-occurrence information of each attribute value of non-missing attributes in the first data sample in the second data sample conditioned on the attribute value corresponding to the missing attribute, and weight value information corresponding to the attribute value of non-missing attributes of the first data sample;

the second occurrence information calculation module is configured to:

according to the formulaCalculating second occurrence rate information of each attribute value corresponding to the missing attribute, wherein P (y)_i| R) is the attribute value y corresponding to the missing attribute_iSecond occurrence rate information, P (y)_i) The attribute value y corresponding to the missing attribute_iFirst sub-occurrence information of (W)_jAs attribute value R_jCorresponding to weight value information, P (R)_j|y_i) As attribute value R_jWith attribute value y_iConditional second sub-occurrence information, the attribute value R_jAttribute values for non-missing attributes in the first data sample.

10. The apparatus of claim 7, further comprising:

a first occurrence rate information determining module, configured to train a second data sample to determine first occurrence rate information of missing attributes and non-missing attributes in the data sample before the first occurrence rate information of the missing attributes and the non-missing attributes in the data sample is obtained;

the first occurrence information determination module includes:

the data sample dividing unit is used for dividing the second data sample into a training sample set and a testing sample set according to a set proportion;

a current occurrence rate information determining unit, configured to train the training sample set to determine current occurrence rate information of missing attributes and non-missing attributes in data samples;

the occurrence rate information testing unit is used for testing the current occurrence rate information by adopting the test sample set to generate a test result;

the cyclic calling unit is used for ending the training operation if the test result meets a set accuracy threshold; otherwise, the training sample set and the test sample set are divided again, and the newly divided training sample set is trained until the test result meets the set accuracy threshold;

and the first occurrence rate information marking unit is used for marking the current occurrence rate information when the training is finished as the first occurrence rate information of the missing attribute and the non-missing attribute in the data sample.