CN114443635A - Data cleaning method and device in soil big data analysis - Google Patents
Data cleaning method and device in soil big data analysis Download PDFInfo
- Publication number
- CN114443635A CN114443635A CN202210067946.4A CN202210067946A CN114443635A CN 114443635 A CN114443635 A CN 114443635A CN 202210067946 A CN202210067946 A CN 202210067946A CN 114443635 A CN114443635 A CN 114443635A
- Authority
- CN
- China
- Prior art keywords
- data
- dispersed
- soil
- scattered
- sphere
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 239000002689 soil Substances 0.000 title claims abstract description 62
- 238000004140 cleaning Methods 0.000 title claims abstract description 44
- 238000007405 data analysis Methods 0.000 title claims abstract description 24
- 238000012937 correction Methods 0.000 claims abstract description 24
- 239000006185 dispersion Substances 0.000 claims abstract description 24
- 230000007613 environmental effect Effects 0.000 claims abstract description 21
- 238000010606 normalization Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 17
- 238000012935 Averaging Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 11
- 230000010339 dilation Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 claims description 6
- 229910052799 carbon Inorganic materials 0.000 claims description 6
- 239000004927 clay Substances 0.000 claims description 6
- 238000005286 illumination Methods 0.000 claims description 6
- 239000004576 sand Substances 0.000 claims description 6
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 4
- 239000010802 sludge Substances 0.000 claims description 3
- 238000005406 washing Methods 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 abstract description 21
- 230000005856 abnormality Effects 0.000 abstract description 3
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 238000003066 decision tree Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000004141 dimensional analysis Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/24—Earth materials
- G01N33/246—Earth materials for water content
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Remote Sensing (AREA)
- Food Science & Technology (AREA)
- General Life Sciences & Earth Sciences (AREA)
- Geology (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Environmental & Geological Engineering (AREA)
- Medicinal Chemistry (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Processing Of Solid Wastes (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the field of power systems, in particular to a data cleaning method and device in soil big data analysis. The method comprises the following steps: collecting soil data, and acquiring environmental data when the soil data is collected; performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; constructing a scattered data sphere based on the data structure and the data size of each scattered data; and finally, constructing a data cleaning cube, and integrating the data cleaning cube and the dispersed data spheres to obtain final cleaning data. The method and the device provided by the invention have the advantages that the data cleaning is carried out by using a mode of only searching the abnormal value of the data in the prior art, the normal data is marked by using a mode of constructing a data cube, the abnormal data is further corrected, a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, and the accuracy of data cleaning is obviously improved.
Description
Technical Field
The invention belongs to the field of data analysis, and particularly relates to a data cleaning method and device in soil big data analysis.
Background
Data cleansing (Data cleansing) refers to a process of re-examining and verifying Data with the purpose of deleting duplicate information, correcting existing errors, and providing Data consistency.
Data cleansing also looks by name to "wash out" dirty, meaning the last procedure to find and correct recognizable errors in a data file, including checking data consistency, handling invalid and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, so that the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are obviously unwanted and are called as 'dirty data'. The dirty data is required to be washed off according to a certain rule, and the washing is the data washing. And the task of data cleaning is to filter the data which do not meet the requirements, send the filtering result to the service administration department, and confirm whether to filter or to be corrected by the service administration department and then extract. The unsatisfactory data mainly comprises three major categories of incomplete data, erroneous data and repeated data. Data cleaning is different from questionnaire examination, and data cleaning after entry is generally completed by a computer instead of a human.
Patent document No. CN201510947469.0 discloses a posture and orbit control data analysis method based on decision tree, which includes posture and orbit control data preprocessing, and by the data preprocessing, telemetry data deduplication, telemetry data sorting, telemetry data extraction, and telemetry data outlier elimination are completed; the attitude and orbit control system is modeled hierarchically, information and a control flow chart of the attitude and orbit control system are established, a telemetering variable related to the current fault of the attitude and orbit control system is determined and is used as an input variable for decision tree analysis; establishing a flow chart of decision tree analysis; and (3) a decision tree model, namely creating a decision tree C5.0 algorithm model, defining the model name in the model, Boosting algorithm test times, pruning attributes and the minimum record number of each sub-branch.
The patent mentions a related technical scheme for data cleaning, but the cleaning mode still uses the existing conventional technology, and the cleaned data still has an abnormal value, so that the accuracy of subsequent data analysis is reduced.
Disclosure of Invention
The invention mainly aims to provide a data cleaning method and a device in soil big data analysis, the method and the device are different from the prior art, the data cleaning is carried out by only searching abnormal values of the data, the normal data are marked by a data cube construction-based method, the abnormal data are further corrected, and a correction model is constructed by combining the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a method of data cleansing in soil big data analysis, the method performing the steps of:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
Further, the range of the ratio set in the step 2 is as follows: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
Further, the method for constructing the scattered data sphere in step 3 specifically includes: calculating the data volume of the scattered data, taking the calculated data volume of the scattered data as the radius of a scattered data sphere, and constructing a scattered data sphere by using a preset data sphere construction model so as to enable the scattered data to be uniformly distributed on the outer surface of the scattered data sphere.
Further, the method can be used for preparing a novel materialThe data sphere construction model is expressed by the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
Further, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
Further, the third feature dispersion data has the same length as the second feature dispersion data.
Further, in the step of performing a first group of dilation normalization processes on the first feature scattered data to generate second feature scattered data, the number of dilation normalization processes of the first group of dilation normalization processes is three.
Further, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
Further, the correction value model is expressed using the following formula: where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.
Data cleaning device in soil big data analysis.
According to the data cleaning method and device in the soil big data analysis, the data cleaning is carried out in a mode of only searching abnormal values of the data, the mode of constructing a data cube is used for marking normal data, the abnormal data are corrected, and a correction model is constructed in combination with the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved. The method is mainly realized by the following steps: amplification of data: according to the method, the normal data and the abnormal data are distinguished for the first time through data amplification, the abnormal data obtained after distinguishing are obviously different from the normal data, and the distinguishing can improve the efficiency of subsequent data cleaning; construction of a data cube: the constructed data cube can carry out more three-dimensional analysis on data and can also directly lock abnormal data, and through two layers of data cubes, corresponding friction fit is carried out, if the friction fit can be carried out, the abnormal data is normal data, and if the friction fit cannot be carried out, the abnormal data is abnormal data, so that the method for finding the abnormal data is reduced in efficiency compared with the prior art, but is higher in accuracy; and (3) data correction: according to the method, the soil data is corrected by combining the environment, the temperature and the humidity during soil data acquisition, so that the effectiveness and the accuracy of the cleaning data obtained after soil data correction are improved.
Drawings
Fig. 1 is a schematic structural diagram of a system of a data cleaning method in soil big data analysis according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a data scattering principle of a data cleaning method and apparatus for soil big data analysis according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a principle that a dispersed data sphere of the data cleaning method and apparatus for soil big data analysis according to the embodiment of the present invention is turned inside a data cleaning cube.
Detailed Description
The method of the present invention is described in further detail below with reference to the accompanying data sets and examples of the invention.
Example 1
As shown in fig. 1, a data cleaning method in soil big data analysis, the method performs the following steps:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating the principle of data scattering according to the present invention. After data dispersion is carried out, original data are amplified, compared with the original data, the data rules of the amplified data are not changed, but the difference between abnormal data and normal data is obvious, the amplified abnormal data can be found, and therefore the accuracy of data cleaning is improved.
Referring to fig. 3, fig. 3 shows that after the scattered data sphere obtained by the present invention enters the data cleansing cube, the scattered data distributed on the surface of the scattered data sphere will contact the data cleansing cube. In the scattered data sphere, data contacted with the cube can be cleaned with corresponding data, namely normal data, and data not contacted with the cube is abnormal data. Compared with the prior art that abnormal value detection is directly carried out on data, the method can find hidden abnormal data and improve the accuracy of data cleaning.
Example 2
On the basis of the previous embodiment, the range of the ratio set in step 2 is: 3-8; the value range depends on the classification of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
Example 3
On the basis of the above embodiment, the method for constructing a scattered data sphere in step 3 specifically includes: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.
Specifically, consistency check (consistency check) is to check whether data is satisfactory or not according to a reasonable value range and a mutual relationship of each variable, and find data which is out of a normal range, logically unreasonable or contradictory. For example, a variable measured on a scale of 1-7 with a value of 0 and a negative weight should be considered as outside the normal range. Computer software such as SPSS, SAS, Excel and the like can automatically identify variable values of each out-of-range according to the defined value range. Answers with logical inconsistencies may appear in a number of forms: for example, many panelists say themselves are driving to work and report that there is no car; or the panelist reports that he or she is a heavy buyer and user of a certain brand, but at the same time gives a very low score on the familiarity scale. When inconsistency is found, the questionnaire serial number, the record serial number, the variable name, the error category and the like are listed, so that further checking and correction are facilitated.
Example 4
On the basis of the above embodiment, the data sphere construction model is expressed by using the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (c) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
In particular, due to investigation, coding and logging errors, some invalid and missing values may be present in the data, requiring appropriate processing to be given. The common treatment methods are: evaluation, whole case deletion, variable deletion and pair deletion.
Estimation (estimation). The simplest approach is to replace the invalid and missing values with the sample mean, median or mode of a certain variable. This approach is simple, but does not take full account of the information already in the data, and the error may be large. Another approach is to make an estimate based on the panelists' answers to other questions, through correlation analysis or logical inference between variables. For example, the possession of a product may be related to household income, and the likelihood of possession of the product may be inferred from the household income of the panelist.
The whole deletion (casewise deletion) is to eliminate the samples containing missing values. Since many questionnaires may have missing values, the result of this approach may be a large reduction in the effective sample size, and the data that has been collected cannot be fully utilized. Therefore, the method is only suitable for the condition that the key variable is missing or the sample containing invalid value or missing value has small specific gravity.
Variable deletion (variable deletion). If there are many invalid and missing values for a variable and the variable is not particularly important to the problem under study, then the variable may be considered deleted. This reduces the number of variables available for analysis, but does not change the sample size.
Example 5
On the basis of the above embodiment, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
Example 6
On the basis of the above embodiment, the third feature dispersion data and the second feature dispersion data have the same length.
Abnormal data generally falls into two categories:
1) avoidable dirty data
Dirty data of the type that can be avoided, as the name implies, can be made valid directly by simple processing or avoided by artificial modification.
Such dirty data is quite common in everyday life, e.g., errors due to naming irregularities, spelling errors, entry errors, nulls, etc.
2) Unavoidable dirty data
Unavoidable dirty data, the main forms including outliers, duplicates, nulls, etc.; the processing of such dirty data.
The 3 σ law test is a common detection means for abnormal values (assuming that a group of detection data only contains random errors, the detection data is calculated to obtain standard deviation, a section is determined according to a certain probability, the error exceeding the section is considered not to belong to the random errors but to be coarse errors, the data containing the errors is removed, and generally, the section is three standard deviations of positive and negative of the average value, so the 3 σ law is called).
Example 7
On the basis of the above embodiment, in the step of performing the first group of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first group of dilation normalization processes is three.
Example 8
On the basis of the previous embodiment, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
In particular, data cleansing-the process of rechecking and validating data is intended to remove duplicate information, correct existing errors, and provide data consistency.
Example 9
On the basis of the above embodiment, the correction value model is expressed using the following formula:where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.
Specifically, because soil data is easily influenced by environmental parameters during collection, the soil data needs to be corrected by combining the environmental data, and the corrected data can obviously improve the accuracy.
Example 10
Data cleaning device in soil big data analysis.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of the present disclosure.
The present examples are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein. For example, various elements or components may be combined or combined in another system or certain features may be omitted, or not implemented.
Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may also be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.
Claims (10)
1. A method for data cleaning in soil big data analysis, characterized in that the method comprises the following steps:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
2. The method of claim 1, wherein the range of ratios set in step 2 is: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
3. The method of claim 2, wherein the step 3 of constructing a scatter data sphere specifically comprises: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.
4. The method of claim 3, wherein the data sphere construction model is represented using the formula:wherein C is dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
5. The method of claim 4, wherein the step 4 of performing data analysis on each dispersed data to obtain the data characteristics of all dispersed data comprises: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
6. The method of claim 5, wherein the third feature dispersion data is the same length as the second feature dispersion data.
7. The method according to claim 6, wherein in the step of performing the first set of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first set of dilation normalization processes is three.
8. The method of claim 7, wherein the step 6 of using a preset correction model to correct the combined dispersed data to obtain the final cleaned soil data comprises: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
10. Data washing apparatus for use in soil big data analysis for carrying out the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210067946.4A CN114443635B (en) | 2022-01-20 | 2022-01-20 | Data cleaning method and device in soil big data analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210067946.4A CN114443635B (en) | 2022-01-20 | 2022-01-20 | Data cleaning method and device in soil big data analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114443635A true CN114443635A (en) | 2022-05-06 |
CN114443635B CN114443635B (en) | 2024-04-09 |
Family
ID=81368576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210067946.4A Active CN114443635B (en) | 2022-01-20 | 2022-01-20 | Data cleaning method and device in soil big data analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114443635B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004053659A2 (en) * | 2002-12-10 | 2004-06-24 | Stone Investments, Inc | Method and system for analyzing data and creating predictive models |
CN107741990A (en) * | 2017-11-01 | 2018-02-27 | 深圳汇生通科技股份有限公司 | Data cleansing integration method and system |
CN109739850A (en) * | 2019-01-11 | 2019-05-10 | 安徽爱吉泰克科技有限公司 | A kind of archives big data intellectual analysis cleaning digging system |
CN112733417A (en) * | 2020-11-16 | 2021-04-30 | 南京邮电大学 | Abnormal load data detection and correction method and system based on model optimization |
WO2021139249A1 (en) * | 2020-05-28 | 2021-07-15 | 平安科技(深圳)有限公司 | Data anomaly detection method, apparatus and device, and storage medium |
-
2022
- 2022-01-20 CN CN202210067946.4A patent/CN114443635B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004053659A2 (en) * | 2002-12-10 | 2004-06-24 | Stone Investments, Inc | Method and system for analyzing data and creating predictive models |
CN107741990A (en) * | 2017-11-01 | 2018-02-27 | 深圳汇生通科技股份有限公司 | Data cleansing integration method and system |
CN109739850A (en) * | 2019-01-11 | 2019-05-10 | 安徽爱吉泰克科技有限公司 | A kind of archives big data intellectual analysis cleaning digging system |
WO2021139249A1 (en) * | 2020-05-28 | 2021-07-15 | 平安科技(深圳)有限公司 | Data anomaly detection method, apparatus and device, and storage medium |
CN112733417A (en) * | 2020-11-16 | 2021-04-30 | 南京邮电大学 | Abnormal load data detection and correction method and system based on model optimization |
Non-Patent Citations (2)
Title |
---|
梁卫宁;周钰书;唐文彬;刘森;陈玲娜;: "营销数据清洗及治理方法的研究及应用", 信息技术与信息化, no. 07, 28 July 2020 (2020-07-28) * |
毛云鹏;龙虎;邓韧;郭欣;: "数据清洗在医疗大数据分析中的应用", 中国数字医学, no. 06, 15 June 2017 (2017-06-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN114443635B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lee et al. | Intelliclean: a knowledge-based intelligent data cleaner | |
Wen et al. | MVS-GCN: A prior brain structure learning-guided multi-view graph convolution network for autism spectrum disorder diagnosis | |
CN107633265B (en) | Data processing method and device for optimizing credit evaluation model | |
Low et al. | A knowledge-based approach for duplicate elimination in data cleaning | |
Bock et al. | Analysis of symbolic data: exploratory methods for extracting statistical information from complex data | |
CN107741990B (en) | Data cleaning integration method and system | |
CN114864099B (en) | Clinical data automatic generation method and system based on causal relationship mining | |
CN112597238A (en) | Method, system, device and medium for establishing knowledge graph based on personnel information | |
Al-Rasheed | Identification of important features and data mining classification techniques in predicting employee absenteeism at work. | |
Ebden et al. | Network analysis on provenance graphs from a crowdsourcing application | |
CN114861719A (en) | High-speed train bearing fault diagnosis method based on ensemble learning | |
CN114358611A (en) | Subject development-based data acquisition system for scientific research capability assessment | |
CN114443635B (en) | Data cleaning method and device in soil big data analysis | |
CN111915368B (en) | System, method and medium for identifying customer ID in automobile industry | |
Khoshgoftaar et al. | Detecting noisy instances with the rule-based classification model | |
CN112506907A (en) | Engineering machinery marketing strategy pushing method, system and device based on big data | |
KR101985961B1 (en) | Similarity Quantification System of National Research and Development Program and Searching Cooperative Program using same | |
Pahwa et al. | An efficient algorithm for data cleaning | |
CN116152018A (en) | High and new technology enterprise patent intellectual property project feasibility pre-evaluation system | |
CN114049966B (en) | Food-borne disease outbreak identification method and system based on link prediction | |
CN116028847A (en) | Universal method and system for automatic intelligent diagnosis of turbine mechanical faults | |
CN112559499A (en) | Data mining system and method | |
KR100686466B1 (en) | System and method for valuing loan portfolios using fuzzy clustering | |
Wang et al. | Decision tree classification algorithm for non-equilibrium data set based on random forests | |
CN114596152A (en) | Method, device and storage medium for predicting debt subject default based on unsupervised model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |