CN114443635B

CN114443635B - Data cleaning method and device in soil big data analysis

Info

Publication number: CN114443635B
Application number: CN202210067946.4A
Authority: CN
Inventors: 石媛媛; 邓明军; 唐健; 赵隽宇; 覃祚玉; 宋贤冲; 王会利; 潘波; 覃其云
Original assignee: Guangxi Zhuang Autonomous Region Forestry Research Institute
Current assignee: Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2024-04-09
Anticipated expiration: 2042-01-20
Also published as: CN114443635A

Abstract

The invention relates to the field of power systems, in particular to a data cleaning method and device in soil big data analysis. The method comprises the following steps: collecting soil data, and acquiring environmental data when the soil data are collected; performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; constructing a dispersed data sphere based on the data structure and the data size of each dispersed data; and finally, constructing a data cleaning cube, and integrating the data cleaning cube with the scattered data spheres to obtain final cleaning data. The method is different from the prior art in that the method only carries out the abnormal value searching on the data, but marks the normal data by using a method based on the construction of the data cube, so that the abnormal data is corrected, and a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved.

Description

Data cleaning method and device in soil big data analysis

Technical Field

The invention belongs to the field of data analysis, and particularly relates to a data cleaning method and device in soil big data analysis.

Background

Data cleansing (Data cleansing) refers to the process of re-examining and checking Data, with the aim of deleting duplicate information, correcting errors that exist, and providing Data consistency.

Data cleansing is also known by name as "washing" of "dirty" and refers to the last procedure to find and correct identifiable errors in a data file, including checking for data consistency, handling invalid and missing values, etc. Because the data in the data warehouse is a collection of data that is subject to a certain topic, which is extracted from multiple business systems and contains historical data, it is avoided that none of the data is erroneous data, that some of the data conflicts with each other, and that erroneous or conflicting data is obviously unwanted, called "dirty data". We need to "wash out" dirty data according to certain rules, which is data cleansing. The task of data cleaning is to filter out data which does not meet the requirements, and send the filtered result to the business administration department to confirm whether the data is filtered out or corrected by the business administration department and then to extract the data. The data which does not meet the requirements mainly comprises incomplete data, erroneous data and repeated data. Data cleansing is different from questionnaire auditing, and data cleansing after entry is generally done by a computer rather than manually.

Patent document CN201510947469.0 discloses a method for analyzing attitude and orbit control data based on decision tree, which comprises preprocessing attitude and orbit control data, and finishing remote measurement data de-duplication, remote measurement data sequencing, remote measurement data extraction and remote measurement data outlier rejection through data preprocessing; hierarchical modeling is carried out on the attitude and orbit control system, information and a control flow chart of the attitude and orbit control system are established, a telemetry variable related to the current fault of the attitude and orbit control system is determined, and the telemetry variable is used as an input variable for decision tree analysis; establishing a flow chart of decision tree analysis; the decision tree model is used for creating a decision tree C5.0 algorithm model, and the model name, boosting algorithm test times, pruning attributes and the minimum record number of each sub-branch are defined in the model.

The patent mentions a related technical scheme for cleaning data, but the cleaning mode still uses the prior conventional technology, and abnormal values still exist in the cleaned data, so that the accuracy of subsequent data analysis is reduced.

Disclosure of Invention

The invention mainly aims to provide a data cleaning method and a data cleaning device in soil big data analysis, which are different from the prior art in that only abnormal value searching is carried out on data per se, normal data are marked by a way of constructing a data cube, abnormal data are corrected, and a correction model is constructed by combining abnormality caused by environmental data in soil data, so that the accuracy of data cleaning is remarkably improved.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a method for data cleansing in soil big data analysis, the method comprising the steps of:

step 1: collecting soil data, and acquiring environmental data when the soil data are collected; the collected soil data at least comprises: soil effective water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;

step 2: performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly classifying collected soil data according to data types to obtain a plurality of classified data, and amplifying each classified data according to a set proportion to obtain scattered data;

step 3: constructing a dispersed data sphere based on the data structure and the data size of each dispersed data;

step 4: carrying out data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data, and respectively constructing a data cleaning cube of all pieces of scattered data by taking the data characteristics of each piece of scattered data as a center and taking the data radius of the scattered data as a side length;

step 5: placing the scattered data spheres in the data cleaning cube, turning the scattered data spheres in the data cleaning cube, wherein in the turning process, the scattered data on the surface of the scattered data spheres are contacted with the data cleaning cube, and each contacted scattered data is recorded;

step 6: and reserving recorded scattered data, carrying out data denoising on the scattered data which are not recorded, combining the scattered data after denoising with the recorded scattered data to obtain combined scattered data, and correcting the combined scattered data by using a preset correction model to obtain the final cleaned soil data.

Further, the range of the ratio set in the step 2 is as follows: 3 to 8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.

Further, the method for constructing the dispersion data sphere in the step 3 specifically includes: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere.

Further, the data sphere construction model is expressed using the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of the scatter data; o (O) _x The x-axis coordinate of the sphere center is obtained through calculation; o (O) _y The y-axis coordinate of the sphere center is calculated; the z-axis coordinate of the sphere center is uniformly valued as 0; general purpose medicineAnd (3) passing the sphere center calculated by the sphere construction model, and constructing the dispersed data sphere by taking the data size of the calculated dispersed data as the radius of the dispersed data sphere.

Further, the method for performing data analysis on each piece of scattered data in the step 4 to obtain the data characteristics of all pieces of scattered data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.

Further, the third feature distribution data is the same length as the second feature distribution data.

Further, in the step of performing a first set of expansion normalization processing on the first feature distribution data to generate second feature distribution data, the number of expansion normalization processing times of the first set of expansion normalization processing is three.

Further, in the step 6, the method for correcting the combined dispersed data by using a preset correction model to obtain the final cleaned soil data includes: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.

Further, the correction value model is expressed using the following formula: wherein lambda isThe ambient temperature, θ is ambient humidity, and ψ is ambient light intensity.

Data cleaning device in soil big data analysis.

According to the data cleaning method and device in the soil big data analysis, the data is cleaned in a mode of only searching the abnormal value of the data, which is different from the mode of the prior art, normal data are marked in a mode of constructing a data cube, abnormal data are corrected, and a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved. The method is mainly realized through the following steps: amplification of data: according to the invention, normal data and abnormal data are distinguished for the first time through data amplification, the abnormal data obtained after distinguishing are obviously different from the normal data, and the distinguishing can improve the cleaning efficiency of the follow-up data; construction of a data cube: the built data cube can be used for carrying out more three-dimensional analysis on data, abnormal data can be directly locked, friction fit is correspondingly carried out on the data cubes through two layers of data cubes, if the data cubes can be used for friction fit, the data cubes are normal data, and if the data cubes cannot be used for friction fit, the method for finding the abnormal data is higher in accuracy, although the efficiency is lower than that of the prior art; data correction: according to the invention, soil data is corrected by combining the environment, temperature and humidity during soil data acquisition, so that the effectiveness and accuracy of the cleaning data obtained after soil data correction are improved.

Drawings

Fig. 1 is a schematic system structure diagram of a data cleaning method in soil big data analysis according to an embodiment of the present invention;

fig. 2 is a schematic diagram of data dispersion of a data cleaning method and a data cleaning device in soil big data analysis according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a principle of turning a dispersed data sphere inside a data cleaning cube in the data cleaning method and device in soil big data analysis according to the embodiment of the invention.

Detailed Description

The method of the invention is described in further detail below in connection with the attached dataset and the embodiments of the invention.

Example 1

As shown in fig. 1, a data cleaning method in soil big data analysis performs the steps of:

Referring to fig. 2, fig. 2 is a schematic diagram of data distribution according to the present invention. After data dispersion is carried out, original data are amplified, compared with original data, the amplified data are unchanged in data rule, but abnormal data can be found more due to the obvious difference between the abnormal data and normal data, and therefore the accuracy of data cleaning is improved.

Referring to fig. 3, fig. 3 shows that after the obtained dispersed data spheres enter the data cleaning cube, the dispersed data distributed on the surface of the dispersed data spheres will be in contact with the data cleaning cube. In the dispersed data sphere, normal data can be washed when data contacted with the corresponding data is the cube, and data not contacted is abnormal data. Compared with the prior art, the method can be used for directly detecting the abnormal value of the data, so that hidden abnormal data can be found, and the accuracy of data cleaning is improved.

Example 2

On the basis of the above embodiment, the range of the ratio set in the step 2 is: 3 to 8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.

Example 3

On the basis of the above embodiment, the method for constructing the dispersion data sphere in the step 3 specifically includes: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere.

Specifically, the consistency check (consistency check) is to check whether the data is satisfactory or not according to the reasonable value range and the interrelationship of each variable, and find out the data which is out of the normal range, logically unreasonable or contradictory. For example, a variable measured on a scale of 1-7 has a value of 0 and a negative weight should be considered to be outside the normal range. Computer software such as SPSS, SAS, excel and the like can automatically identify each out-of-range variable value according to a defined value range. Answers with logical inconsistencies may appear in several forms: for example, many panelists say themselves drive to work and report that there is no car; or the panelist reports itself as a heavy purchaser and user of a brand, but at the same time gives a very low score on the familiarity scale. When inconsistent is found, the questionnaire serial number, the record serial number, the variable name, the error category and the like are listed, so that further verification and correction are facilitated.

Example 4

On the basis of the above embodiment, the data sphere construction model is expressed using the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of the scatter data; o (O) _x The x-axis coordinate of the sphere center is obtained through calculation; o (O) _y The y-axis coordinate of the sphere center is calculated; the z-axis coordinate of the sphere center is uniformly valued as 0; and (3) constructing a dispersed data sphere by taking the calculated data size of the dispersed data as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.

Specifically, due to investigation, encoding and logging errors, there may be some invalid and missing values in the data, which need to be handled properly. The usual treatment methods are: estimating, whole case deletion, variable deletion and paired deletion.

Estimation (estimate). The simplest approach is to replace the invalid and missing values with the sample mean, median or mode of a certain variable. This approach is simple, but does not take into account the information already in the data adequately, and the error may be large. Another approach is to estimate the answers to other questions by correlation analysis or logical inference between variables based on the panelist. For example, the possession of a product may be related to household income, and the likelihood of possession of the product may be inferred from the household income of the panelist.

The whole deletion (casewise deletion) is to discard samples containing missing values. Since many questionnaires may have missing values, the result of this may be a significant reduction in the effective sample size, failing to make full use of the data already collected. Therefore, it is only suitable for the case that the critical variable is missing, or that the sample containing the invalid value or missing value has a small specific gravity.

Variable delete (variable deletion). If the invalid and missing values of a variable are numerous and the variable is not particularly important to the problem under study, then the variable may be considered for deletion. This reduces the number of variables for analysis, but does not change the sample size.

Example 5

On the basis of the above embodiment, the method for performing data analysis on each piece of scattered data in step 4 to obtain the data characteristics of all pieces of scattered data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.

Example 6

On the basis of the above embodiment, the third feature distribution data is the same length as the second feature distribution data.

Abnormal data is generally divided into two categories:

1) Avoiding dirty data

Dirty data can be avoided, and as the name suggests, such dirty data can be directly processed into valid data or manually modified to avoid.

Such dirty data is quite common in everyday life, such as errors caused by naming irregularities, spelling errors, input errors, null values, etc.

2) Unavoidable dirty data

Unavoidable dirty data, the main forms including outliers, repeated values, null values, etc.; processing of such dirty data.

The common detection means 3 sigma law of abnormal values is checked (assuming that a group of detection data only contains random errors, the detection data is calculated to obtain standard deviation, a section is determined according to a certain probability, the error exceeding the section is considered to be not random errors but coarse errors, and the data containing the error should be removed, and in general, the section is the mean value plus or minus three standard deviations, thus the 3 sigma law is called).

Example 7

In the step of performing a first set of expansion normalization processing on the first feature distribution data to generate second feature distribution data on the basis of the above embodiment, the number of expansion normalization processing of the first set of expansion normalization processing is three.

Example 8

On the basis of the above embodiment, the method for correcting the combined dispersion data to obtain the final cleaned soil data in step 6 by using a preset correction model includes: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.

Specifically, the process of data cleansing, rechecking and verifying data aims at deleting duplicate information, correcting existing errors and providing data consistency.

Example 9

On the basis of the above embodiment, the correction value model is expressed using the following formula:wherein λ is ambient temperature, θ is ambient humidity, and ψ is ambient illumination intensity.

Specifically, because soil data is easily affected by environmental parameters during collection, the soil data needs to be corrected by combining the environmental data, and the corrected data can obviously improve the accuracy.

Example 10

Data cleaning device in soil big data analysis.

Although several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of the present disclosure.

The present examples are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein. For example, various elements or components may be combined or combined in another system, or certain features may be omitted or not implemented.

Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present invention. Other items shown or discussed as coupled or directly coupled or communicating with each other may also be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art without departing from the spirit and scope disclosed herein.

Claims

1. A method for data cleaning in soil big data analysis, characterized in that the method performs the steps of:

step 6: reserving recorded scattered data, carrying out data denoising on the scattered data which are not recorded, combining the scattered data after denoising with the recorded scattered data to obtain combined scattered data, and correcting the combined scattered data by using a preset correction model to obtain final cleaned soil data;

the method for constructing the dispersed data sphere in the step 3 specifically comprises the following steps: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere;

the number isThe sphere-based build model is expressed using the following formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For dispersing data +.>Is the minimum value of the scattered data; />Is the maximum value of the scattered data; />For calculating the centre of sphereAn axis coordinate; />For the calculated centre of sphere +.>An axis coordinate; ball center +.>The unified value of the axis coordinates is 0; and (3) constructing a dispersed data sphere by taking the calculated data size of the dispersed data as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.

2. The method according to claim 1, wherein the ratio set in step 2 is in the range of: 3-8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.

3. The method of claim 1, wherein the step 4 of performing data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data comprises: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.

4. The method of claim 3, wherein the third feature dispersion data is the same length as the second feature dispersion data.

5. The method of claim 4 wherein in the step of performing a first set of dilation normalization processes on the first feature dispersion data to generate second feature dispersion data, the number of dilation normalization processes of the first set of dilation normalization processes is three.

6. The method of claim 5, wherein the step 6 of correcting the combined dispersion data using a predetermined correction model to obtain the final cleaned soil data comprises: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.

7. The method of claim 6, wherein the correction value model is expressed using the following formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For ambient temperature->Is ambient humidity>Is the ambient light intensity.

8. Data cleaning apparatus for use in soil big data analysis for performing the method of any one of claims 1 to 7.