CN114443635A - Data cleaning method and device in soil big data analysis - Google Patents

Data cleaning method and device in soil big data analysis Download PDF

Info

Publication number
CN114443635A
CN114443635A CN202210067946.4A CN202210067946A CN114443635A CN 114443635 A CN114443635 A CN 114443635A CN 202210067946 A CN202210067946 A CN 202210067946A CN 114443635 A CN114443635 A CN 114443635A
Authority
CN
China
Prior art keywords
data
dispersed
soil
scattered
sphere
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210067946.4A
Other languages
Chinese (zh)
Other versions
CN114443635B (en
Inventor
石媛媛
邓明军
唐健
赵隽宇
覃祚玉
宋贤冲
王会利
潘波
覃其云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Zhuang Autonomous Region Forestry Research Institute
Original Assignee
Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Zhuang Autonomous Region Forestry Research Institute filed Critical Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority to CN202210067946.4A priority Critical patent/CN114443635B/en
Publication of CN114443635A publication Critical patent/CN114443635A/en
Application granted granted Critical
Publication of CN114443635B publication Critical patent/CN114443635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/24Earth materials
    • G01N33/246Earth materials for water content

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Food Science & Technology (AREA)
  • General Life Sciences & Earth Sciences (AREA)
  • Geology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Environmental & Geological Engineering (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Processing Of Solid Wastes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of power systems, in particular to a data cleaning method and device in soil big data analysis. The method comprises the following steps: collecting soil data, and acquiring environmental data when the soil data is collected; performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; constructing a scattered data sphere based on the data structure and the data size of each scattered data; and finally, constructing a data cleaning cube, and integrating the data cleaning cube and the dispersed data spheres to obtain final cleaning data. The method and the device provided by the invention have the advantages that the data cleaning is carried out by using a mode of only searching the abnormal value of the data in the prior art, the normal data is marked by using a mode of constructing a data cube, the abnormal data is further corrected, a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, and the accuracy of data cleaning is obviously improved.

Description

Data cleaning method and device in soil big data analysis
Technical Field
The invention belongs to the field of data analysis, and particularly relates to a data cleaning method and device in soil big data analysis.
Background
Data cleansing (Data cleansing) refers to a process of re-examining and verifying Data with the purpose of deleting duplicate information, correcting existing errors, and providing Data consistency.
Data cleansing also looks by name to "wash out" dirty, meaning the last procedure to find and correct recognizable errors in a data file, including checking data consistency, handling invalid and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, so that the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are obviously unwanted and are called as 'dirty data'. The dirty data is required to be washed off according to a certain rule, and the washing is the data washing. And the task of data cleaning is to filter the data which do not meet the requirements, send the filtering result to the service administration department, and confirm whether to filter or to be corrected by the service administration department and then extract. The unsatisfactory data mainly comprises three major categories of incomplete data, erroneous data and repeated data. Data cleaning is different from questionnaire examination, and data cleaning after entry is generally completed by a computer instead of a human.
Patent document No. CN201510947469.0 discloses a posture and orbit control data analysis method based on decision tree, which includes posture and orbit control data preprocessing, and by the data preprocessing, telemetry data deduplication, telemetry data sorting, telemetry data extraction, and telemetry data outlier elimination are completed; the attitude and orbit control system is modeled hierarchically, information and a control flow chart of the attitude and orbit control system are established, a telemetering variable related to the current fault of the attitude and orbit control system is determined and is used as an input variable for decision tree analysis; establishing a flow chart of decision tree analysis; and (3) a decision tree model, namely creating a decision tree C5.0 algorithm model, defining the model name in the model, Boosting algorithm test times, pruning attributes and the minimum record number of each sub-branch.
The patent mentions a related technical scheme for data cleaning, but the cleaning mode still uses the existing conventional technology, and the cleaned data still has an abnormal value, so that the accuracy of subsequent data analysis is reduced.
Disclosure of Invention
The invention mainly aims to provide a data cleaning method and a device in soil big data analysis, the method and the device are different from the prior art, the data cleaning is carried out by only searching abnormal values of the data, the normal data are marked by a data cube construction-based method, the abnormal data are further corrected, and a correction model is constructed by combining the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a method of data cleansing in soil big data analysis, the method performing the steps of:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
Further, the range of the ratio set in the step 2 is as follows: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
Further, the method for constructing the scattered data sphere in step 3 specifically includes: calculating the data volume of the scattered data, taking the calculated data volume of the scattered data as the radius of a scattered data sphere, and constructing a scattered data sphere by using a preset data sphere construction model so as to enable the scattered data to be uniformly distributed on the outer surface of the scattered data sphere.
Further, the method can be used for preparing a novel materialThe data sphere construction model is expressed by the following formula:
Figure BDA0003480901000000031
wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
Further, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
Further, the third feature dispersion data has the same length as the second feature dispersion data.
Further, in the step of performing a first group of dilation normalization processes on the first feature scattered data to generate second feature scattered data, the number of dilation normalization processes of the first group of dilation normalization processes is three.
Further, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
Further, the correction value model is expressed using the following formula:
Figure BDA0003480901000000041
Figure BDA0003480901000000042
where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.
Data cleaning device in soil big data analysis.
According to the data cleaning method and device in the soil big data analysis, the data cleaning is carried out in a mode of only searching abnormal values of the data, the mode of constructing a data cube is used for marking normal data, the abnormal data are corrected, and a correction model is constructed in combination with the abnormality caused by environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved. The method is mainly realized by the following steps: amplification of data: according to the method, the normal data and the abnormal data are distinguished for the first time through data amplification, the abnormal data obtained after distinguishing are obviously different from the normal data, and the distinguishing can improve the efficiency of subsequent data cleaning; construction of a data cube: the constructed data cube can carry out more three-dimensional analysis on data and can also directly lock abnormal data, and through two layers of data cubes, corresponding friction fit is carried out, if the friction fit can be carried out, the abnormal data is normal data, and if the friction fit cannot be carried out, the abnormal data is abnormal data, so that the method for finding the abnormal data is reduced in efficiency compared with the prior art, but is higher in accuracy; and (3) data correction: according to the method, the soil data is corrected by combining the environment, the temperature and the humidity during soil data acquisition, so that the effectiveness and the accuracy of the cleaning data obtained after soil data correction are improved.
Drawings
Fig. 1 is a schematic structural diagram of a system of a data cleaning method in soil big data analysis according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a data scattering principle of a data cleaning method and apparatus for soil big data analysis according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a principle that a dispersed data sphere of the data cleaning method and apparatus for soil big data analysis according to the embodiment of the present invention is turned inside a data cleaning cube.
Detailed Description
The method of the present invention is described in further detail below with reference to the accompanying data sets and examples of the invention.
Example 1
As shown in fig. 1, a data cleaning method in soil big data analysis, the method performs the following steps:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating the principle of data scattering according to the present invention. After data dispersion is carried out, original data are amplified, compared with the original data, the data rules of the amplified data are not changed, but the difference between abnormal data and normal data is obvious, the amplified abnormal data can be found, and therefore the accuracy of data cleaning is improved.
Referring to fig. 3, fig. 3 shows that after the scattered data sphere obtained by the present invention enters the data cleansing cube, the scattered data distributed on the surface of the scattered data sphere will contact the data cleansing cube. In the scattered data sphere, data contacted with the cube can be cleaned with corresponding data, namely normal data, and data not contacted with the cube is abnormal data. Compared with the prior art that abnormal value detection is directly carried out on data, the method can find hidden abnormal data and improve the accuracy of data cleaning.
Example 2
On the basis of the previous embodiment, the range of the ratio set in step 2 is: 3-8; the value range depends on the classification of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
Example 3
On the basis of the above embodiment, the method for constructing a scattered data sphere in step 3 specifically includes: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.
Specifically, consistency check (consistency check) is to check whether data is satisfactory or not according to a reasonable value range and a mutual relationship of each variable, and find data which is out of a normal range, logically unreasonable or contradictory. For example, a variable measured on a scale of 1-7 with a value of 0 and a negative weight should be considered as outside the normal range. Computer software such as SPSS, SAS, Excel and the like can automatically identify variable values of each out-of-range according to the defined value range. Answers with logical inconsistencies may appear in a number of forms: for example, many panelists say themselves are driving to work and report that there is no car; or the panelist reports that he or she is a heavy buyer and user of a certain brand, but at the same time gives a very low score on the familiarity scale. When inconsistency is found, the questionnaire serial number, the record serial number, the variable name, the error category and the like are listed, so that further checking and correction are facilitated.
Example 4
On the basis of the above embodiment, the data sphere construction model is expressed by using the following formula:
Figure BDA0003480901000000071
wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (c) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
In particular, due to investigation, coding and logging errors, some invalid and missing values may be present in the data, requiring appropriate processing to be given. The common treatment methods are: evaluation, whole case deletion, variable deletion and pair deletion.
Estimation (estimation). The simplest approach is to replace the invalid and missing values with the sample mean, median or mode of a certain variable. This approach is simple, but does not take full account of the information already in the data, and the error may be large. Another approach is to make an estimate based on the panelists' answers to other questions, through correlation analysis or logical inference between variables. For example, the possession of a product may be related to household income, and the likelihood of possession of the product may be inferred from the household income of the panelist.
The whole deletion (casewise deletion) is to eliminate the samples containing missing values. Since many questionnaires may have missing values, the result of this approach may be a large reduction in the effective sample size, and the data that has been collected cannot be fully utilized. Therefore, the method is only suitable for the condition that the key variable is missing or the sample containing invalid value or missing value has small specific gravity.
Variable deletion (variable deletion). If there are many invalid and missing values for a variable and the variable is not particularly important to the problem under study, then the variable may be considered deleted. This reduces the number of variables available for analysis, but does not change the sample size.
Example 5
On the basis of the above embodiment, the method for performing data analysis on each dispersed data in step 4 to obtain the data characteristics of all dispersed data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
Example 6
On the basis of the above embodiment, the third feature dispersion data and the second feature dispersion data have the same length.
Abnormal data generally falls into two categories:
1) avoidable dirty data
Dirty data of the type that can be avoided, as the name implies, can be made valid directly by simple processing or avoided by artificial modification.
Such dirty data is quite common in everyday life, e.g., errors due to naming irregularities, spelling errors, entry errors, nulls, etc.
2) Unavoidable dirty data
Unavoidable dirty data, the main forms including outliers, duplicates, nulls, etc.; the processing of such dirty data.
The 3 σ law test is a common detection means for abnormal values (assuming that a group of detection data only contains random errors, the detection data is calculated to obtain standard deviation, a section is determined according to a certain probability, the error exceeding the section is considered not to belong to the random errors but to be coarse errors, the data containing the errors is removed, and generally, the section is three standard deviations of positive and negative of the average value, so the 3 σ law is called).
Example 7
On the basis of the above embodiment, in the step of performing the first group of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first group of dilation normalization processes is three.
Example 8
On the basis of the previous embodiment, the method for correcting the combined dispersed data by using a preset correction model in the step 6 to obtain the finally cleaned soil data includes: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
In particular, data cleansing-the process of rechecking and validating data is intended to remove duplicate information, correct existing errors, and provide data consistency.
Example 9
On the basis of the above embodiment, the correction value model is expressed using the following formula:
Figure BDA0003480901000000091
where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.
Specifically, because soil data is easily influenced by environmental parameters during collection, the soil data needs to be corrected by combining the environmental data, and the corrected data can obviously improve the accuracy.
Example 10
Data cleaning device in soil big data analysis.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of the present disclosure.
The present examples are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein. For example, various elements or components may be combined or combined in another system or certain features may be omitted, or not implemented.
Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may also be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.

Claims (10)

1. A method for data cleaning in soil big data analysis, characterized in that the method comprises the following steps:
step 1: collecting soil data, and acquiring environmental data when the soil data is collected; the collected soil data at least comprises: soil available water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly, classifying collected soil data according to data types to obtain a plurality of classification data, and then amplifying each classification data according to a set proportion to obtain dispersed data;
and step 3: constructing a scattered data sphere based on the data structure and the data size of each scattered data;
and 4, step 4: performing data analysis on each dispersed data to obtain data characteristics of all dispersed data, and respectively constructing data cleaning cubes of all dispersed data by taking the data characteristics of each dispersed data as a center and taking the data radius of the dispersed data as a side length;
and 5: placing the scattered data spheres inside the data cleaning cube, and turning the scattered data spheres inside the data cleaning cube, wherein in the turning process, the scattered data on the surfaces of the scattered data spheres are in contact with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving the recorded dispersed data, performing data noise reduction on the unrecorded dispersed data, combining the noise-reduced dispersed data with the recorded dispersed data to obtain combined dispersed data, and correcting the combined dispersed data by using a preset correction model to obtain finally cleaned soil data.
2. The method of claim 1, wherein the range of ratios set in step 2 is: 3-8; the value range depends on the category of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classification data is the sand content, the value of the set proportion is 4; when the classification data is the sludge content, the value of the set proportion is 5; when the classification data is the clay content, the value of the set proportion is 6; when the classification data is the soil volume weight, the value of the set proportion is 7; organic carbon content the value of the set ratio is 8.
3. The method of claim 2, wherein the step 3 of constructing a scatter data sphere specifically comprises: and calculating the data volume of the dispersed data, taking the data volume of the dispersed data obtained by calculation as the radius of a dispersed data sphere, and constructing a dispersed data sphere by using a preset data sphere construction model so as to enable the dispersed data to be uniformly distributed on the outer surface of the dispersed data sphere.
4. The method of claim 3, wherein the data sphere construction model is represented using the formula:
Figure RE-FDA0003580025980000021
wherein C is dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of scatter data; o isxCalculating the x-axis coordinate of the sphere center; o isyCalculating the y-axis coordinate of the sphere center; the z-axis coordinate of the sphere center is uniformly set to 0; and (4) constructing a dispersed data sphere by using the data volume of the dispersed data obtained by calculation as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
5. The method of claim 4, wherein the step 4 of performing data analysis on each dispersed data to obtain the data characteristics of all dispersed data comprises: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein a normalization layer for normalization is n groups, and an averaging layer for averaging is n-1 layers; performing a first group of expansion normalization processing on the first characteristic scattered data to generate second characteristic scattered data, and performing averaging processing on the first characteristic scattered data to generate third characteristic scattered data; splicing the third characteristic scattered data and the second characteristic scattered data to generate fourth characteristic scattered data; and taking the fourth characteristic scattered data as the data characteristic of the scattered data.
6. The method of claim 5, wherein the third feature dispersion data is the same length as the second feature dispersion data.
7. The method according to claim 6, wherein in the step of performing the first set of dilation normalization processes on the first feature dispersion data to generate the second feature dispersion data, the number of dilation normalization processes of the first set of dilation normalization processes is three.
8. The method of claim 7, wherein the step 6 of using a preset correction model to correct the combined dispersed data to obtain the final cleaned soil data comprises: and calculating to obtain a correction value by using a preset correction value model based on the obtained environmental temperature, environmental humidity and environmental illumination intensity, and multiplying each data in the combined dispersed data by using the correction value to obtain the finally cleaned soil data.
9. The method of claim 8, wherein the correction value model is represented using the following equation:
Figure FDA0003480900990000031
where λ is the ambient temperature, θ is the ambient humidity, and ψ is the ambient illumination intensity.
10. Data washing apparatus for use in soil big data analysis for carrying out the method of any one of claims 1 to 9.
CN202210067946.4A 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis Active CN114443635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210067946.4A CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210067946.4A CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Publications (2)

Publication Number Publication Date
CN114443635A true CN114443635A (en) 2022-05-06
CN114443635B CN114443635B (en) 2024-04-09

Family

ID=81368576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210067946.4A Active CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Country Status (1)

Country Link
CN (1) CN114443635B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053659A2 (en) * 2002-12-10 2004-06-24 Stone Investments, Inc Method and system for analyzing data and creating predictive models
CN107741990A (en) * 2017-11-01 2018-02-27 深圳汇生通科技股份有限公司 Data cleansing integration method and system
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
CN112733417A (en) * 2020-11-16 2021-04-30 南京邮电大学 Abnormal load data detection and correction method and system based on model optimization
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053659A2 (en) * 2002-12-10 2004-06-24 Stone Investments, Inc Method and system for analyzing data and creating predictive models
CN107741990A (en) * 2017-11-01 2018-02-27 深圳汇生通科技股份有限公司 Data cleansing integration method and system
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium
CN112733417A (en) * 2020-11-16 2021-04-30 南京邮电大学 Abnormal load data detection and correction method and system based on model optimization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
梁卫宁;周钰书;唐文彬;刘森;陈玲娜;: "营销数据清洗及治理方法的研究及应用", 信息技术与信息化, no. 07, 28 July 2020 (2020-07-28) *
毛云鹏;龙虎;邓韧;郭欣;: "数据清洗在医疗大数据分析中的应用", 中国数字医学, no. 06, 15 June 2017 (2017-06-15) *

Also Published As

Publication number Publication date
CN114443635B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
Lee et al. Intelliclean: a knowledge-based intelligent data cleaner
Wen et al. MVS-GCN: A prior brain structure learning-guided multi-view graph convolution network for autism spectrum disorder diagnosis
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
Low et al. A knowledge-based approach for duplicate elimination in data cleaning
Bock et al. Analysis of symbolic data: exploratory methods for extracting statistical information from complex data
CN107741990B (en) Data cleaning integration method and system
CN114864099B (en) Clinical data automatic generation method and system based on causal relationship mining
CN112597238A (en) Method, system, device and medium for establishing knowledge graph based on personnel information
Al-Rasheed Identification of important features and data mining classification techniques in predicting employee absenteeism at work.
Ebden et al. Network analysis on provenance graphs from a crowdsourcing application
CN114861719A (en) High-speed train bearing fault diagnosis method based on ensemble learning
CN114358611A (en) Subject development-based data acquisition system for scientific research capability assessment
CN114443635B (en) Data cleaning method and device in soil big data analysis
CN111915368B (en) System, method and medium for identifying customer ID in automobile industry
Khoshgoftaar et al. Detecting noisy instances with the rule-based classification model
CN112506907A (en) Engineering machinery marketing strategy pushing method, system and device based on big data
KR101985961B1 (en) Similarity Quantification System of National Research and Development Program and Searching Cooperative Program using same
Pahwa et al. An efficient algorithm for data cleaning
CN116152018A (en) High and new technology enterprise patent intellectual property project feasibility pre-evaluation system
CN114049966B (en) Food-borne disease outbreak identification method and system based on link prediction
CN116028847A (en) Universal method and system for automatic intelligent diagnosis of turbine mechanical faults
CN112559499A (en) Data mining system and method
KR100686466B1 (en) System and method for valuing loan portfolios using fuzzy clustering
Wang et al. Decision tree classification algorithm for non-equilibrium data set based on random forests
CN114596152A (en) Method, device and storage medium for predicting debt subject default based on unsupervised model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant