CN112685416A - Data missing item filling method and device, computer equipment and storage medium - Google Patents

Data missing item filling method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112685416A
CN112685416A CN202011612026.3A CN202011612026A CN112685416A CN 112685416 A CN112685416 A CN 112685416A CN 202011612026 A CN202011612026 A CN 202011612026A CN 112685416 A CN112685416 A CN 112685416A
Authority
CN
China
Prior art keywords
missing
data
filling
field
items
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011612026.3A
Other languages
Chinese (zh)
Inventor
白王梓松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011612026.3A priority Critical patent/CN112685416A/en
Publication of CN112685416A publication Critical patent/CN112685416A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a data missing item filling method, a data missing item filling device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring first data and detecting whether the first data is missing; if the first data are missing, calculating the number of missing items of the first data; if the number of the missing items belongs to a first preset range [1, N ], filling by adopting a random filling strategy based on naive Bayes; wherein N is a positive integer; if the number of the missing items belongs to a second preset range [ N +1, M ], filling by adopting a weighted mean filling strategy based on KNN; wherein M is greater than the N +1, and M is a positive integer; and if the number of the missing items belongs to a third preset range [ M +1, + ∞ ], determining a filling method according to a preset rule. By the method and the device for filling the missing data items, the computer equipment and the storage medium, the missing data items can be accurately filled.

Description

Data missing item filling method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for filling data missing items, a computer device, and a storage medium.
Background
In the process of making and analyzing the data visualization daily report, the condition that part of field dimensions are obviously abnormal or missing is empty often exists, obviously abnormal values are deleted during processing, and at the moment, all null data are uniformly called missing data values.
In the visual data analysis, the data population containing the missing data cannot be directly used. When data of missing items exist, indexes such as mean, sum and variance are calculated, and abnormal calculation or obvious errors are caused by the missing items. In the data processing, the common method is to simply remove the whole missing item data, so that the data population has a certain loss, the obtained visual daily report cannot relatively accurately reflect the sales performance result, and if the number of the missing items is large, the drawn image cannot truly reflect the data population. Although the missing data belongs to abnormal data, the missing data usually only occupies a small part of the data dimension, and if the whole data is directly deleted, the integrity of the first data is greatly weakened.
Disclosure of Invention
The application mainly aims to provide a data missing item filling method, a data missing item filling device, computer equipment and a storage medium, and aims to solve the technical problem of inaccurate missing item filling.
In order to achieve the above object, the present application provides a data missing item filling method, including the following steps:
acquiring first data and detecting whether the first data is missing;
if the first data are missing, calculating the number of missing items of the first data;
if the number of the missing items belongs to a first preset range [1, N ], filling by adopting a random filling strategy based on naive Bayes; wherein N is a positive integer;
if the number of the missing items belongs to a second preset range [ N +1, M ], filling by adopting a weighted mean filling strategy based on KNN; wherein M is greater than the N +1, and M is a positive integer;
and if the number of the missing items belongs to a third preset range [ M +1, + ∞ ], determining a filling method according to a preset rule.
Further, if the number of missing entries belongs to a third preset range [ M +1, + ∞ ], the step of determining the filling method according to a preset rule includes:
acquiring a mark field of the first data;
detecting whether the mark field is lack of entries;
if the mark field has an item missing, calculating the number of first missing items in the mark field;
if the first missing item number belongs to a first preset range [1, N ], filling missing items of the mark field by adopting a naive Bayes-based random filling strategy; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
Further, if the first missing item number belongs to a first preset range [1, N ], filling the missing item of the mark field by adopting a naive Bayes-based random filling strategy; if the first missing item number belongs to a second preset range [ N +1, M ], after the step of filling by adopting a weighted mean filling strategy based on the KNN, the method comprises the following steps:
detecting whether a derivative field exists in the mark field;
if the derived field exists, detecting whether the derived field is lack of entries;
if the derived field has an item missing, calculating the number of second missing items in the derived field;
if the second missing item number belongs to a first preset range [1, N ], filling the missing items of the derived field by adopting a naive Bayes-based random filling strategy; and if the second missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the derived field by adopting a weighted mean filling strategy based on KNN.
Further, if the number of missing entries belongs to a third preset range [ M +1, + ∞ ], the step of determining the filling method according to a preset rule includes:
classifying fields in the first data according to a preset classification algorithm to obtain a plurality of field groups;
calculating the third missing item number of each field group;
if the third missing item number belongs to a first preset range [1, N ], filling the missing item of the first data by adopting a naive Bayes-based random filling strategy; and if the third missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the first data by adopting a weighted mean filling strategy based on KNN.
Further, after the step of calculating the number of missing items of the first data, the method includes:
acquiring a first field of a missing item with a time characteristic in the first data;
determining a second field having a periodic nature with the first field according to the first field;
detecting whether the second field is missing;
if the second field is not missing, acquiring a period interval between the first field and the second field;
determining a data value of the first field from the periodic interval and the second field.
Further, the step of padding by using a naive bayes-based random padding strategy comprises:
acquiring a plurality of second data; wherein the fields of the second data are not missing entries;
deleting the field of the missing item of the first data, and grouping the deleted first data and a plurality of second data by adopting a naive Bayes method to obtain a plurality of data groups;
in the data packet where the first data is located, a formula is adopted
Figure BDA0002874947910000031
Calculating a target data value corresponding to the field of the missing item, and filling the target data value into the first data; wherein said ymaxiThe data value of the field of the missing item of the second data in the data packet where the first data is located.
The present application further provides a data missing item filling device, including:
the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring first data and detecting whether the first data is missing;
the calculating unit is used for calculating the number of missing items of the first data if the missing items exist;
the first filling unit is used for filling by adopting a random filling strategy based on naive Bayes if the number of the missing items belongs to a first preset range [1, N ]; wherein N is a positive integer;
the second filling unit is used for filling by adopting a weighted mean filling strategy based on KNN if the missing item number belongs to a second preset range [ N +1, M ]; wherein M is greater than the N +1, and M is a positive integer;
and the third filling unit is used for determining a filling method according to a preset rule if the number of the missing items belongs to a third preset range [ M +1, + ∞ ].
Further, the third shim cell includes:
an obtaining subunit, configured to obtain a tag field of the first data;
a first detecting subunit, configured to detect whether the tag field is an entry missing;
the first calculating subunit is used for calculating the number of first missing items in the mark field if the mark field is missing;
a first padding subunit, configured to, if the first number of missing entries belongs to a first preset range [1, N ], pad the missing entries of the mark field by using a naive bayes-based random padding policy; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
The application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the data missing item filling method when executing the computer program.
The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the data gap filling method according to any one of the above.
According to the data missing item filling method and device, the computer equipment and the storage medium, different filling strategies are determined according to the number of missing items for filling, the advantages of the different filling strategies are reasonably utilized, missing items can be filled more accurately, and meanwhile, less time cost and less space cost can be adopted for filling the first data.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a data missing item filling method according to an embodiment of the present application;
FIG. 2 is a block diagram of a data missing filling apparatus according to an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a data missing item filling method, including the following steps:
step S1, acquiring first data, and detecting whether the first data is missing;
step S2, if the item is missing, calculating the number of missing items of the first data;
step S3, if the number of the missing items belongs to a first preset range [1, N ], filling by adopting a random filling strategy based on naive Bayes; wherein N is a positive integer;
step S4, if the number of the missing items belongs to a second preset range [ N +1, M ], filling the missing items by adopting a weighted mean filling strategy based on KNN; wherein M is greater than the N +1, and M is a positive integer;
and step S5, if the number of the missing items belongs to a third preset range [ M +1, + ∞ ], determining a filling method according to a preset rule.
In this embodiment, as described in the above steps S1-S2, each piece of first data corresponds to a plurality of fields, each field corresponds to a data value, and the like, when the data value corresponding to the field is missing, it indicates that the first data item is missing, and how many data values corresponding to the fields are missing indicates how many missing items are.
When the number of missing items belongs to a first preset range [1, N ] and belongs to mild missing items, the random filling strategy based on naive Bayes is adopted for filling, and the random filling strategy can be represented by the following formula
Figure BDA0002874947910000051
Wherein, ymaxiRepresenting P (y) in naive Bayes classificationi| x) the item value of the maximum one category. In order to meet the dispersion trend of the defect entity, a noise epsilon is added, wherein epsilon is mu-0 and sigma-sigma0Is normally distributed random number, here σ0Is the standard deviation of the test item corresponding to the missing item in the normal first data population. The random filling based on naive Bayes classification needs to customize a classifier according to a missing first data volume, if only 1 item is missing, the classifier is calculated according to the missing item, and 8 kinds of classifiers need to be trained for use; when the missing term is 2, 28 classifiers are required, and the time cost and the space cost are correspondingly increased.
When the number of missing items belongs to a second preset range [ N +1, M ] and belongs to a medium missing item, a weighted mean filling strategy based on KNN is adopted for filling, and the number of missing items can be represented by the following formula:
Figure BDA0002874947910000052
wherein x isiIs a weighted value, diK is a constant, and can be determined by cross-validation. Specifically, the weighted mean filling strategy based on the KNN finds K pieces of data closest to the data in the complete data by using the fields without missing items in the first data, calculates the distances between the K pieces of data and the first data, finds the weighted mean of the fields with missing items in the K pieces of data, and finally takes the weighted mean as a filling value. Specifically, the distance may be an euclidean distance, wherein the formula for calculating the euclidean distance is as follows:
Figure BDA0002874947910000061
data were normalized using the Z-Score method, the formula is as follows:
Figure BDA0002874947910000062
and using the normalized data, and aiming at the condition of item missing, calculating the Euclidean distance between the corresponding residual item and the first data in the normal data by using a formula, and sequencing. And calculating the weighted average value of the missing items corresponding to the k data closest to the current data, and filling the missing items by using the weighted average value. The weighted mean filling based on the KNN is adopted, and the time and space cost is almost the same when the number of the missing items is large or small, so that the method has advantages when more missing items are processed, and has insufficient cost when the number of the missing items is small.
In the embodiment, the corresponding filling strategies are determined directly according to the number of missing items, and under the condition that a large amount of first data is not lost, the defects of different filling algorithms are neutralized in the comprehensive filling strategies, so that the advantages of different filling strategies are improved to the maximum extent, and the influence of the missing items of first data on the whole is reduced to the minimum. Meanwhile, the standardized first data preprocessing flow can further improve the first data analysis efficiency, reduce the labor cost of enterprises in the first data analysis, improve the accuracy of performance analysis and provide guidance for performance prediction and correction.
In an embodiment, if the number of missing entries belongs to a third preset range [ M +1, + ∞ ], the step of determining the padding method according to a preset rule includes:
step S51, acquiring a mark field of the first data;
step S52, detecting whether the label field is missing;
step S53, if the mark field has a missing item, calculating the first missing item number in the mark field;
step S54, if the first missing item number belongs to a first preset range [1, N ], filling the missing item of the mark field by adopting a random filling strategy based on naive Bayes; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
In this embodiment, in the sales first data, the derived fields are many, and these derived fields are often calculated or derived from some fields. In this case, some fields from which other derived fields are easily derived may be marked, first, the number of first missing entries in the marked field is calculated, when the number of first missing entries is within a first preset range [1, N ], the random filling strategy based on naive bayes is adopted to perform filling, and when the number of first missing entries is within a second preset range [ N +1, M ], the weighted mean filling strategy based on KNN is adopted to perform filling.
In an embodiment, if the first number of missing entries belongs to a first preset range [1, N ], the missing entries of the mark field are padded by using a naive bayes-based random padding strategy; if the first number of missing entries belongs to a second preset range [ N +1, M ], after the step S54 of padding by using a weighted mean padding strategy based on KNN, the method includes:
step S55, detecting whether the label field has a derivative field;
step S56, if the derived field exists, detecting whether the derived field is missing;
step S57, if the derived field has missing items, calculating the second missing item number in the derived field;
step S58, if the second missing item number belongs to a first preset range [1, N ], filling the missing items of the derived fields by adopting a random filling strategy based on naive Bayes; and if the second missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the derived field by adopting a weighted mean filling strategy based on KNN.
In this embodiment, after the tag field is filled, whether the corresponding derived field is missing is determined according to the tag field, if the derived field is missing, the second missing number in the derived field is calculated, and the corresponding filling method is determined according to the second missing number. In the embodiment, the derived fields and the marked fields are distinguished and then processed in batches, so that the increase of time and space unit price due to excessive total number of missing items is avoided.
In an embodiment, if the number of missing entries belongs to a third preset range [ M +1, + ∞ ], the step of determining the padding method according to a preset rule includes:
step S5a, grouping fields in the first data according to a preset classification algorithm to obtain a plurality of field groups;
step S5b, calculating the third missing item number of each field group;
step S5c, if the third missing item number belongs to a first preset range [1, N ], filling the missing item of the first field by adopting a random filling strategy based on naive Bayes; and if the third missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the first data by adopting a weighted mean filling strategy based on KNN.
In this embodiment, a preset classification algorithm may be used to classify the missing fields, and specifically, the corresponding classification model may be trained according to specific first data, so as to classify the fields of the first data, and obtain a plurality of field groups. And calculating the third missing item number of each field group, determining a corresponding filling strategy according to the third missing item number and filling. In the embodiment, the fields are grouped, the filling strategy is determined according to the third missing item number of each field group, and more missing item numbers are divided into a plurality of groups, so that the third missing item number of each field group is less, the probability of thoroughly eliminating the first data can be triggered to the minimum degree, and the integrity of the first data is ensured to the maximum degree.
In an embodiment, after the step S2 of calculating the number of missing items of the first data, the method includes:
step S2A, acquiring a first field of the missing item with time characteristic in the first data;
step S2B, determining a second field having a periodic property with the first field according to the first field;
step S2C, detecting whether the second field is missing;
step S2D, if the second field is not missing, obtaining the period interval between the first field and the second field;
step S2E, determining a data value of the first field according to the periodic interval and the second field.
In this embodiment, as described in step S2A, the first data includes many fields of time characteristics, such as time of sale, time of deposit, and the like in the first data for sale. And acquiring a first field with time characteristics of the missing item, wherein if the sales time is missing in certain sales first data, the sales time is taken as the first field.
As described in the above step S4, in the normal flow, the time interval between the sales time and the deposit time of different first data fluctuates in a small time range, i.e., the sales time and the deposit time have a periodic nature. But often there are multiple second fields that have periodic properties with respect to the first field, with the time interval between the first field and a different second field being different.
As described in the above steps S2B-S2E, the second field may also have an entry missing status, and if the entry missing status indicates that the time point of the first field cannot be obtained from the second field, the time interval is obtained by calculating an average value between a plurality of real first fields and second fields. The data value of the first field can be determined according to the second field and the corresponding time interval.
In this embodiment, in the first data of loan sale, the effect of filling the time point field with the above algorithm is poor, because the field of the time characteristic has greater randomness and customization, but often in the statistical period of the same caliber, it has statistical regularity, so in actual operation, it is necessary to calculate a new period field according to the time point field according to the business statistical period, fill the time point field with the period field instead of the time point field using the algorithm, and after filling, the time point field is reversely deduced, and then fill the time point field.
In an embodiment, the step S3 of padding with a naive bayes-based random padding policy includes:
step S31, acquiring a plurality of second data; wherein the fields of the second data are not missing entries;
step S32, deleting the field of the missing item of the first data, and grouping the deleted first data and a plurality of second data by adopting a naive Bayes method to obtain a plurality of data groups;
step S33, in the data packet where the first data is located, adopting a formula
Figure BDA0002874947910000091
Figure BDA0002874947910000092
Calculating a target data value corresponding to the field of the missing item, and filling the target data value into the first data; wherein said ymaxiThe data value of the field of the missing item of the second data in the data packet where the first data is located.
In this embodiment, the second data is complete data, that is, the data value of the same field in the second data as the field of the missing entry in the first data is complete. Deleting the field of the missing item in the first data to obtain a new first data, and thenGrouping the new first data and the second data by a naive Bayes method to obtain a plurality of data groups, wherein the new first data belongs to one of the data groups, and adopting the data group
Figure BDA0002874947910000093
And calculating to obtain a target data value, specifically, if the field lacking the item in the first data is the field a, calculating the data value of the field a of the other second data in the data packet by using the formula to obtain the target data value, and filling the target data value into the first data.
Referring to fig. 2, an embodiment of the present application provides a data missing filling apparatus, including:
a first obtaining unit 10, configured to obtain first data and detect whether the first data is missing;
a calculating unit 20, configured to calculate the number of missing entries of the first data if the missing entries exist;
a first padding unit 30, configured to, if the number of missing entries belongs to a first preset range [1, N ], pad with a naive bayes-based random padding policy; wherein N is a positive integer;
a second padding unit 40, configured to, if the number of missing entries belongs to a second preset range [ N +1, M ], pad the missing entries by using a weighted mean padding policy based on KNN; wherein M is greater than the N +1, and M is a positive integer;
and the third filling unit 50 is configured to determine a filling method according to a preset rule if the number of missing entries belongs to a third preset range [ M +1, + ∞ ].
In one embodiment, the third shim cell 50 includes:
a first obtaining subunit, configured to obtain a tag field of the first data;
a first detecting subunit, configured to detect whether the tag field is an entry missing;
the first calculating subunit is used for calculating the number of first missing items in the mark field if the mark field is missing;
a first padding subunit, configured to, if the first number of missing entries belongs to a first preset range [1, N ], pad the missing entries of the mark field by using a naive bayes-based random padding policy; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
In one embodiment, the third shim cell 50 further includes:
the second detection subunit is used for detecting whether the tag field has a derivative field;
a third detecting subunit, configured to detect whether the derived field is missing if the derived field exists;
the second calculation subunit is configured to calculate a second number of missing entries in the derived field if the derived field has missing entries;
a second padding subunit, configured to, if the second number of missing entries belongs to a first preset range [1, N ], pad the missing entries of the derived field by using a naive bayes-based random padding policy; and if the second missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the derived field by adopting a weighted mean filling strategy based on KNN.
In one embodiment, the third shim cell 50 includes:
the grouping subunit is used for grouping the fields in the first data according to a preset classification algorithm to obtain a plurality of field groups;
the third calculating subunit is used for calculating the third missing item number of each field group;
a third filling subunit, configured to fill the missing item of the first data by using a naive bayes-based random filling policy if the third missing item number belongs to a first preset range [1, N ]; and if the third missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the first data by adopting a weighted mean filling strategy based on KNN.
In an embodiment, the data missing item filling apparatus further includes:
the second acquisition unit is used for acquiring a first field with a time characteristic missing item in the first data;
a first determining unit, configured to determine, according to the first field, a second field having a periodic property with the first field;
the detection unit is used for detecting whether the second field is lack of entries;
a third obtaining unit, configured to obtain a period interval between the first field and the second field if the second field is not missing;
a second determining unit, configured to determine a data value of the first field according to the periodic interval and the second field.
In one embodiment, the first shim cell 30 includes:
the second acquisition subunit is used for acquiring a plurality of second data; wherein the fields of the second data are not missing entries;
the deleting subunit is used for deleting the field of the missing item of the first data, and grouping the deleted first data and the plurality of second data by adopting a naive Bayesian method to obtain a plurality of data groups;
a fourth calculating subunit, configured to adopt a formula in the data packet in which the first data is located
Figure BDA0002874947910000111
Calculating a target data value corresponding to the field of the missing item, and filling the target data value into the first data; wherein said ymaxiThe data value of the field of the missing item of the second data in the data packet where the first data is located.
In this embodiment, please refer to the above method embodiment for the specific implementation of each unit and sub-unit, which is not described herein again.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a first database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a first database. The internal memory provides an environment for the operating system and computer programs in the storage medium to run. The first database of the computer device is used for storing first data, second data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data entry filling.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for filling data missing items.
In summary, for the data missing item filling method, apparatus, computer device and storage medium provided in the embodiment of the present application, first data is obtained, and whether the first data is missing is detected; if the first data are missing, calculating the number of missing items of the first data; if the number of the missing items belongs to a first preset range [1, N ], filling by adopting a random filling strategy based on naive Bayes; wherein N is a positive integer; if the number of the missing items belongs to a second preset range [ N +1, M ], filling by adopting a weighted mean filling strategy based on KNN; wherein M is greater than the N +1, and M is a positive integer; and if the number of the missing items belongs to a third preset range [ M +1, + ∞ ], determining a filling method according to a preset rule. By the data missing item filling method and device, the computer equipment and the storage medium, the corresponding filling strategies can be determined according to the number of missing items, different missing item numbers correspond to different filling strategies, the advantages of different filling strategies are reasonably utilized, the first data can be filled more accurately, and meanwhile, less time cost and less space cost can be adopted to fill the first data.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, a first database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (10)

1. A method for filling data missing items is characterized by comprising the following steps:
acquiring first data and detecting whether the first data is missing;
if the first data are missing, calculating the number of missing items of the first data;
if the number of the missing items belongs to a first preset range [1, N ], filling by adopting a random filling strategy based on naive Bayes; wherein N is a positive integer;
if the number of the missing items belongs to a second preset range [ N +1, M ], filling by adopting a weighted mean filling strategy based on KNN; wherein M is greater than the N +1, and M is a positive integer;
and if the number of the missing items belongs to a third preset range [ M +1, + ∞ ], determining a filling method according to a preset rule.
2. The method for filling missing data entries according to claim 1, wherein if the number of missing entries belongs to a third predetermined range [ M +1, + ∞), the step of determining the filling method according to a predetermined rule includes:
acquiring a mark field of the first data;
detecting whether the mark field is lack of entries;
if the mark field has an item missing, calculating the number of first missing items in the mark field;
if the first missing item number belongs to a first preset range [1, N ], filling missing items of the mark field by adopting a naive Bayes-based random filling strategy; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
3. The method according to claim 2, wherein if the first number of missing entries belongs to a first preset range [1, N ], the missing entries of the mark field are padded by a naive bayes-based random padding strategy; if the first missing item number belongs to a second preset range [ N +1, M ], after the step of filling by adopting a weighted mean filling strategy based on the KNN, the method comprises the following steps:
detecting whether a derivative field exists in the mark field;
if the derived field exists, detecting whether the derived field is lack of entries;
if the derived field has an item missing, calculating the number of second missing items in the derived field;
if the second missing item number belongs to a first preset range [1, N ], filling the missing items of the derived field by adopting a naive Bayes-based random filling strategy; and if the second missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the derived field by adopting a weighted mean filling strategy based on KNN.
4. The method for filling missing data entries according to claim 1, wherein if the number of missing entries belongs to a third predetermined range [ M +1, + ∞), the step of determining the filling method according to a predetermined rule includes:
classifying fields in the first data according to a preset classification algorithm to obtain a plurality of field groups;
calculating the third missing item number of each field group;
if the third missing item number belongs to a first preset range [1, N ], filling the missing item of the first data by adopting a naive Bayes-based random filling strategy; and if the third missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the first data by adopting a weighted mean filling strategy based on KNN.
5. The method for filling missing data items according to claim 1, wherein the step of calculating the missing item number of the first data items is followed by:
acquiring a first field of a missing item with a time characteristic in the first data;
determining a second field having a periodic nature with the first field according to the first field;
detecting whether the second field is missing;
if the second field is not missing, acquiring a period interval between the first field and the second field;
determining a data value of the first field from the periodic interval and the second field.
6. The method for filling the missing data items according to claim 1, wherein the step of filling by using a naive Bayes-based random filling strategy comprises:
acquiring a plurality of second data; wherein the fields of the second data are not missing entries;
deleting the field of the missing item of the first data, and grouping the deleted first data and a plurality of second data by adopting a naive Bayes method to obtain a plurality of data groups;
in the data packet where the first data is located, a formula is adopted
Figure FDA0002874947900000031
Calculating a target data value corresponding to the field of the missing item, and filling the target data value into the first data; wherein said ymaxiThe data value of the field of the missing item of the second data in the data packet where the first data is located.
7. A data missing filling apparatus, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring first data and detecting whether the first data is missing;
the calculating unit is used for calculating the number of missing items of the first data if the missing items exist;
the first filling unit is used for filling by adopting a random filling strategy based on naive Bayes if the number of the missing items belongs to a first preset range [1, N ]; wherein N is a positive integer;
the second filling unit is used for filling by adopting a weighted mean filling strategy based on KNN if the missing item number belongs to a second preset range [ N +1, M ]; wherein M is greater than the N +1, and M is a positive integer;
and the third filling unit is used for determining a filling method according to a preset rule if the number of the missing items belongs to a third preset range [ M +1, + ∞ ].
8. The data missing filling apparatus of claim 7, wherein the third filling unit comprises:
an obtaining subunit, configured to obtain a tag field of the first data;
a first detecting subunit, configured to detect whether the tag field is an entry missing;
the first calculating subunit is used for calculating the number of first missing items in the mark field if the mark field is missing;
a first padding subunit, configured to, if the first number of missing entries belongs to a first preset range [1, N ], pad the missing entries of the mark field by using a naive bayes-based random padding policy; and if the first missing item number belongs to a second preset range [ N +1, M ], filling the missing items of the mark field by adopting a weighted mean filling strategy based on KNN.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the data gap filling method of any one of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data gap filling method of any one of claims 1 to 6.
CN202011612026.3A 2020-12-30 2020-12-30 Data missing item filling method and device, computer equipment and storage medium Pending CN112685416A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011612026.3A CN112685416A (en) 2020-12-30 2020-12-30 Data missing item filling method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011612026.3A CN112685416A (en) 2020-12-30 2020-12-30 Data missing item filling method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112685416A true CN112685416A (en) 2021-04-20

Family

ID=75455315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011612026.3A Pending CN112685416A (en) 2020-12-30 2020-12-30 Data missing item filling method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112685416A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742296A (en) * 2021-09-09 2021-12-03 诺优信息技术(上海)有限公司 Method and device for slicing drive test data and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742296A (en) * 2021-09-09 2021-12-03 诺优信息技术(上海)有限公司 Method and device for slicing drive test data and electronic equipment
CN113742296B (en) * 2021-09-09 2024-04-30 诺优信息技术(上海)有限公司 Drive test data slicing processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN108876133B (en) Risk assessment processing method, device, server and medium based on business information
WO2020062702A9 (en) Method and device for sending text messages, computer device and storage medium
CN110738236B (en) Image matching method and device, computer equipment and storage medium
CN109508903B (en) Risk assessment method, risk assessment device, computer equipment and storage medium
CN110991474A (en) Machine learning modeling platform
CN109410070B (en) Nuclear protection data processing method and system
CN110503566B (en) Wind control model building method and device, computer equipment and storage medium
CN109271957B (en) Face gender identification method and device
CN113723861A (en) Abnormal electricity consumption behavior detection method and device, computer equipment and storage medium
CN111767192B (en) Business data detection method, device, equipment and medium based on artificial intelligence
CN112685416A (en) Data missing item filling method and device, computer equipment and storage medium
CN116881718A (en) Artificial intelligence training method and system based on big data cleaning
CN115563275A (en) Multi-dimensional self-adaptive log classification and classification method and device
CN110442764B (en) Contract generation method and device based on data crawling and computer equipment
CN111985577A (en) Customer value labeling method, device, equipment and medium based on artificial intelligence
CN112365149B (en) Performance evaluation method, device, equipment and storage medium
CN111885181B (en) Monitoring data reporting method and device, computer equipment and storage medium
CN112163110B (en) Image classification method and device, electronic equipment and computer-readable storage medium
CN111277465A (en) Abnormal data message detection method and device and electronic equipment
CN114565452A (en) Transfer risk identification method and device, computer equipment and storage medium
CN114529136A (en) Electronic part component evaluation method and device based on principal component analysis and Topsis
CN110232302B (en) Method for detecting change of integrated gray value, spatial information and category knowledge
Grundy On aspects of changepoint analysis motivated by industrial applications
CN111258788A (en) Disk failure prediction method, device and computer readable storage medium
WO2024022450A1 (en) Scene adaptability improvement method and apparatus for object detection, and object detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210420

WD01 Invention patent application deemed withdrawn after publication