CN114443635A

CN114443635A - Data cleaning method and device in soil big data analysis

Info

Publication number: CN114443635A
Application number: CN202210067946.4A
Authority: CN
Inventors: 石媛媛; 邓明军; 唐健; 赵隽宇; 覃祚玉; 宋贤冲; 王会利; 潘波; 覃其云
Original assignee: Guangxi Zhuang Autonomous Region Forestry Research Institute
Current assignee: Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-06
Anticipated expiration: 2042-01-20
Also published as: CN114443635B

Abstract

The invention relates to the field of power systems, in particular to a data cleaning method and device in soil big data analysis. The method comprises the following steps: collecting soil data, and acquiring environmental data when the soil data is collected; performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; constructing a scattered data sphere based on the data structure and the data size of each scattered data; and finally, constructing a data cleaning cube, and integrating the data cleaning cube and the dispersed data spheres to obtain final cleaning data. The method and the device provided by the invention have the advantages that the data cleaning is carried out by using a mode of only searching the abnormal value of the data in the prior art, the normal data is marked by using a mode of constructing a data cube, the abnormal data is further corrected, a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, and the accuracy of data cleaning is obviously improved.

Description

Data cleaning method and device in soil big data analysis

Technical Field

The invention belongs to the field of data analysis, and particularly relates to a data cleaning method and device in soil big data analysis.

Background

Data cleansing (Data cleansing) refers to a process of re-examining and verifying Data with the purpose of deleting duplicate information, correcting existing errors, and providing Data consistency.

Data cleansing also looks by name to "wash out" dirty, meaning the last procedure to find and correct recognizable errors in a data file, including checking data consistency, handling invalid and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, so that the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are obviously unwanted and are called as 'dirty data'. The dirty data is required to be washed off according to a certain rule, and the washing is the data washing. And the task of data cleaning is to filter the data which do not meet the requirements, send the filtering result to the service administration department, and confirm whether to filter or to be corrected by the service administration department and then extract. The unsatisfactory data mainly comprises three major categories of incomplete data, erroneous data and repeated data. Data cleaning is different from questionnaire examination, and data cleaning after entry is generally completed by a computer instead of a human.

Patent document No. CN201510947469.0 discloses a posture and orbit control data analysis method based on decision tree, which includes posture and orbit control data preprocessing, and by the data preprocessing, telemetry data deduplication, telemetry data sorting, telemetry data extraction, and telemetry data outlier elimination are completed; the attitude and orbit control system is modeled hierarchically, information and a control flow chart of the attitude and orbit control system are established, a telemetering variable related to the current fault of the attitude and orbit control system is determined and is used as an input variable for decision tree analysis; establishing a flow chart of decision tree analysis; and (3) a decision tree model, namely creating a decision tree C5.0 algorithm model, defining the model name in the model, Boosting algorithm test times, pruning attributes and the minimum record number of each sub-branch.

The patent mentions a related technical scheme for data cleaning, but the cleaning mode still uses the existing conventional technology, and the cleaned data still has an abnormal value, so that the accuracy of subsequent data analysis is reduced.

Disclosure of Invention

The invention mainly aims to provide a data cleaning method and a device in soil big data analysis, the method and the device are different from the prior art, the data cleaning is carried out by only searching abnormal values of the data, the normal data are marked by a data cube construction-based method, the abnormal data are further corrected, and a correction model is constructed by combining the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a method of data cleansing in soil big data analysis, the method performing the steps of:

step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;

step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;

and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;

and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;

and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;

step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.

Further, the range of the ratio set in the step 2 is as follows: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.

Further, the method for constructing the scattered data sphere in step 3 specifically includes: calculating the data volume of the scattered data, taking the calculated data volume of the scattered data as the radius of a scattered data sphere, and constructing a scattered data sphere by using a preset data sphere construction model so as to enable the scattered data to be uniformly distributed on the outer surface of the scattered data sphere.

Further, the method can be used for preparing a novel materialThe data sphere construction model is expressed by the following formula:

wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o is_xCalculating the x-axis coordinate of the sphere center; o is_yCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.

Further, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.

Further, the third feature dispersion data has the same length as the second feature dispersion data.

Further, in the step of performing a first group of dilation normalization processes on the first feature scattered data to generate second feature scattered data, the number of dilation normalization processes of the first group of dilation normalization processes is three.

Further, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.

Further, the correction value model is expressed using the following formula:

where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.

Data cleaning device in soil big data analysis.

According to the data cleaning method and device in the soil big data analysis, the data cleaning is carried out in a mode of only searching abnormal values of the data, the mode of constructing a data cube is used for marking normal data, the abnormal data are corrected, and a correction model is constructed in combination with the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved. The method is mainly realized by the following steps: amplification of data: according to the method, the normal data and the abnormal data are distinguished for the first time through data amplification, the abnormal data obtained after distinguishing are obviously different from the normal data, and the distinguishing can improve the efficiency of subsequent data cleaning; construction of a data cube: the constructed data cube can carry out more three-dimensional analysis on data and can also directly lock abnormal data, and through two layers of data cubes, corresponding friction fit is carried out, if the friction fit can be carried out, the abnormal data is normal data, and if the friction fit cannot be carried out, the abnormal data is abnormal data, so that the method for finding the abnormal data is reduced in efficiency compared with the prior art, but is higher in accuracy; and (3) data correction: according to the method, the soil data is corrected by combining the environment, the temperature and the humidity during soil data acquisition, so that the effectiveness and the accuracy of the cleaning data obtained after soil data correction are improved.

Drawings

Fig. 1 is a schematic structural diagram of a system of a data cleaning method in soil big data analysis according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a data scattering principle of a data cleaning method and apparatus for soil big data analysis according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a principle that a dispersed data sphere of the data cleaning method and apparatus for soil big data analysis according to the embodiment of the present invention is turned inside a data cleaning cube.

Detailed Description

The method of the present invention is described in further detail below with reference to the accompanying data sets and examples of the invention.

Example 1

As shown in fig. 1, a data cleaning method in soil big data analysis, the method performs the following steps:

Referring to fig. 2, fig. 2 is a schematic diagram illustrating the principle of data scattering according to the present invention. After data dispersion is carried out, original data are amplified, compared with the original data, the data rules of the amplified data are not changed, but the difference between abnormal data and normal data is obvious, the amplified abnormal data can be found, and therefore the accuracy of data cleaning is improved.

Referring to fig. 3, fig. 3 shows that after the scattered data sphere obtained by the present invention enters the data cleansing cube, the scattered data distributed on the surface of the scattered data sphere will contact the data cleansing cube. In the scattered data sphere, data contacted with the cube can be cleaned with corresponding data, namely normal data, and data not contacted with the cube is abnormal data. Compared with the prior art that abnormal value detection is directly carried out on data, the method can find hidden abnormal data and improve the accuracy of data cleaning.

Example 2

On the basis of the previous embodiment, the range of the ratio set in step 2 is: 3-8; the value range depends on the classification of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.

Example 3

On the basis of the above embodiment, the method for constructing a scattered data sphere in step 3 specifically includes: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.

Specifically, consistency check (consistency check) is to check whether data is satisfactory or not according to a reasonable value range and a mutual relationship of each variable, and find data which is out of a normal range, logically unreasonable or contradictory. For example, a variable measured on a scale of 1-7 with a value of 0 and a negative weight should be considered as outside the normal range. Computer software such as SPSS, SAS, Excel and the like can automatically identify variable values of each out-of-range according to the defined value range. Answers with logical inconsistencies may appear in a number of forms: for example, many panelists say themselves are driving to work and report that there is no car; or the panelist reports that he or she is a heavy buyer and user of a certain brand, but at the same time gives a very low score on the familiarity scale. When inconsistency is found, the questionnaire serial number, the record serial number, the variable name, the error category and the like are listed, so that further checking and correction are facilitated.

Example 4

On the basis of the above embodiment, the data sphere construction model is expressed by using the following formula:

In particular, due to investigation, coding and logging errors, some invalid and missing values may be present in the data, requiring appropriate processing to be given. The common treatment methods are: evaluation, whole case deletion, variable deletion and pair deletion.

Estimation (estimation). The simplest approach is to replace the invalid and missing values with the sample mean, median or mode of a certain variable. This approach is simple, but does not take full account of the information already in the data, and the error may be large. Another approach is to make an estimate based on the panelists' answers to other questions, through correlation analysis or logical inference between variables. For example, the possession of a product may be related to household income, and the likelihood of possession of the product may be inferred from the household income of the panelist.

The whole deletion (casewise deletion) is to eliminate the samples containing missing values. Since many questionnaires may have missing values, the result of this approach may be a large reduction in the effective sample size, and the data that has been collected cannot be fully utilized. Therefore, the method is only suitable for the condition that the key variable is missing or the sample containing invalid value or missing value has small specific gravity.

Variable deletion (variable deletion). If there are many invalid and missing values for a variable and the variable is not particularly important to the problem under study, then the variable may be considered deleted. This reduces the number of variables available for analysis, but does not change the sample size.

Example 5

On the basis of the above embodiment, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.

Example 6

On the basis of the above embodiment, the third feature dispersion data and the second feature dispersion data have the same length.

Abnormal data generally falls into two categories:

1) avoidable dirty data

Dirty data of the type that can be avoided, as the name implies, can be made valid directly by simple processing or avoided by artificial modification.

Such dirty data is quite common in everyday life, e.g., errors due to naming irregularities, spelling errors, entry errors, nulls, etc.

2) Unavoidable dirty data

Unavoidable dirty data, the main forms including outliers, duplicates, nulls, etc.; the processing of such dirty data.

The 3 σ law test is a common detection means for abnormal values (assuming that a group of detection data only contains random errors, the detection data is calculated to obtain standard deviation, a section is determined according to a certain probability, the error exceeding the section is considered not to belong to the random errors but to be coarse errors, the data containing the errors is removed, and generally, the section is three standard deviations of positive and negative of the average value, so the 3 σ law is called).

Example 7

On the basis of the above embodiment, in the step of performing the first group of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first group of dilation normalization processes is three.

Example 8

On the basis of the previous embodiment, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.

In particular, data cleansing-the process of rechecking and validating data is intended to remove duplicate information, correct existing errors, and provide data consistency.

Example 9

On the basis of the above embodiment, the correction value model is expressed using the following formula:

Specifically, because soil data is easily influenced by environmental parameters during collection, the soil data needs to be corrected by combining the environmental data, and the corrected data can obviously improve the accuracy.

Example 10

Data cleaning device in soil big data analysis.

While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of the present disclosure.

The present examples are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein. For example, various elements or components may be combined or combined in another system or certain features may be omitted, or not implemented.

Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may also be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.

Claims

1. A method for data cleaning in soil big data analysis, characterized in that the method comprises the following steps:

2. The method of claim 1, wherein the range of ratios set in step 2 is: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.

3. The method of claim 2, wherein the step 3 of constructing a scatter data sphere specifically comprises: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.

4. The method of claim 3, wherein the data sphere construction model is represented using the formula:

wherein C is dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o is_xCalculating the x-axis coordinate of the sphere center; o is_yCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.

5. The method of claim 4, wherein the step 4 of performing data analysis on each dispersed data to obtain the data characteristics of all dispersed data comprises: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.

6. The method of claim 5, wherein the third feature dispersion data is the same length as the second feature dispersion data.

7. The method according to claim 6, wherein in the step of performing the first set of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first set of dilation normalization processes is three.

8. The method of claim 7, wherein the step 6 of using a preset correction model to correct the combined dispersed data to obtain the final cleaned soil data comprises: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.

9. The method of claim 8, wherein the correction value model is represented using the following equation:

10. Data washing apparatus for use in soil big data analysis for carrying out the method of any one of claims 1 to 9.