CN114443635B - Data cleaning method and device in soil big data analysis - Google Patents

Data cleaning method and device in soil big data analysis Download PDF

Info

Publication number
CN114443635B
CN114443635B CN202210067946.4A CN202210067946A CN114443635B CN 114443635 B CN114443635 B CN 114443635B CN 202210067946 A CN202210067946 A CN 202210067946A CN 114443635 B CN114443635 B CN 114443635B
Authority
CN
China
Prior art keywords
data
scattered
soil
sphere
dispersion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210067946.4A
Other languages
Chinese (zh)
Other versions
CN114443635A (en
Inventor
石媛媛
邓明军
唐健
赵隽宇
覃祚玉
宋贤冲
王会利
潘波
覃其云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Zhuang Autonomous Region Forestry Research Institute
Original Assignee
Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Zhuang Autonomous Region Forestry Research Institute filed Critical Guangxi Zhuang Autonomous Region Forestry Research Institute
Priority to CN202210067946.4A priority Critical patent/CN114443635B/en
Publication of CN114443635A publication Critical patent/CN114443635A/en
Application granted granted Critical
Publication of CN114443635B publication Critical patent/CN114443635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/24Earth materials
    • G01N33/246Earth materials for water content

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Food Science & Technology (AREA)
  • General Life Sciences & Earth Sciences (AREA)
  • Geology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Environmental & Geological Engineering (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Processing Of Solid Wastes (AREA)

Abstract

The invention relates to the field of power systems, in particular to a data cleaning method and device in soil big data analysis. The method comprises the following steps: collecting soil data, and acquiring environmental data when the soil data are collected; performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; constructing a dispersed data sphere based on the data structure and the data size of each dispersed data; and finally, constructing a data cleaning cube, and integrating the data cleaning cube with the scattered data spheres to obtain final cleaning data. The method is different from the prior art in that the method only carries out the abnormal value searching on the data, but marks the normal data by using a method based on the construction of the data cube, so that the abnormal data is corrected, and a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved.

Description

Data cleaning method and device in soil big data analysis
Technical Field
The invention belongs to the field of data analysis, and particularly relates to a data cleaning method and device in soil big data analysis.
Background
Data cleansing (Data cleansing) refers to the process of re-examining and checking Data, with the aim of deleting duplicate information, correcting errors that exist, and providing Data consistency.
Data cleansing is also known by name as "washing" of "dirty" and refers to the last procedure to find and correct identifiable errors in a data file, including checking for data consistency, handling invalid and missing values, etc. Because the data in the data warehouse is a collection of data that is subject to a certain topic, which is extracted from multiple business systems and contains historical data, it is avoided that none of the data is erroneous data, that some of the data conflicts with each other, and that erroneous or conflicting data is obviously unwanted, called "dirty data". We need to "wash out" dirty data according to certain rules, which is data cleansing. The task of data cleaning is to filter out data which does not meet the requirements, and send the filtered result to the business administration department to confirm whether the data is filtered out or corrected by the business administration department and then to extract the data. The data which does not meet the requirements mainly comprises incomplete data, erroneous data and repeated data. Data cleansing is different from questionnaire auditing, and data cleansing after entry is generally done by a computer rather than manually.
Patent document CN201510947469.0 discloses a method for analyzing attitude and orbit control data based on decision tree, which comprises preprocessing attitude and orbit control data, and finishing remote measurement data de-duplication, remote measurement data sequencing, remote measurement data extraction and remote measurement data outlier rejection through data preprocessing; hierarchical modeling is carried out on the attitude and orbit control system, information and a control flow chart of the attitude and orbit control system are established, a telemetry variable related to the current fault of the attitude and orbit control system is determined, and the telemetry variable is used as an input variable for decision tree analysis; establishing a flow chart of decision tree analysis; the decision tree model is used for creating a decision tree C5.0 algorithm model, and the model name, boosting algorithm test times, pruning attributes and the minimum record number of each sub-branch are defined in the model.
The patent mentions a related technical scheme for cleaning data, but the cleaning mode still uses the prior conventional technology, and abnormal values still exist in the cleaned data, so that the accuracy of subsequent data analysis is reduced.
Disclosure of Invention
The invention mainly aims to provide a data cleaning method and a data cleaning device in soil big data analysis, which are different from the prior art in that only abnormal value searching is carried out on data per se, normal data are marked by a way of constructing a data cube, abnormal data are corrected, and a correction model is constructed by combining abnormality caused by environmental data in soil data, so that the accuracy of data cleaning is remarkably improved.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
a method for data cleansing in soil big data analysis, the method comprising the steps of:
step 1: collecting soil data, and acquiring environmental data when the soil data are collected; the collected soil data at least comprises: soil effective water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly classifying collected soil data according to data types to obtain a plurality of classified data, and amplifying each classified data according to a set proportion to obtain scattered data;
step 3: constructing a dispersed data sphere based on the data structure and the data size of each dispersed data;
step 4: carrying out data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data, and respectively constructing a data cleaning cube of all pieces of scattered data by taking the data characteristics of each piece of scattered data as a center and taking the data radius of the scattered data as a side length;
step 5: placing the scattered data spheres in the data cleaning cube, turning the scattered data spheres in the data cleaning cube, wherein in the turning process, the scattered data on the surface of the scattered data spheres are contacted with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving recorded scattered data, carrying out data denoising on the scattered data which are not recorded, combining the scattered data after denoising with the recorded scattered data to obtain combined scattered data, and correcting the combined scattered data by using a preset correction model to obtain the final cleaned soil data.
Further, the range of the ratio set in the step 2 is as follows: 3 to 8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.
Further, the method for constructing the dispersion data sphere in the step 3 specifically includes: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere.
Further, the data sphere construction model is expressed using the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of the scatter data; o (O) x The x-axis coordinate of the sphere center is obtained through calculation; o (O) y The y-axis coordinate of the sphere center is calculated; the z-axis coordinate of the sphere center is uniformly valued as 0; general purpose medicineAnd (3) passing the sphere center calculated by the sphere construction model, and constructing the dispersed data sphere by taking the data size of the calculated dispersed data as the radius of the dispersed data sphere.
Further, the method for performing data analysis on each piece of scattered data in the step 4 to obtain the data characteristics of all pieces of scattered data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.
Further, the third feature distribution data is the same length as the second feature distribution data.
Further, in the step of performing a first set of expansion normalization processing on the first feature distribution data to generate second feature distribution data, the number of expansion normalization processing times of the first set of expansion normalization processing is three.
Further, in the step 6, the method for correcting the combined dispersed data by using a preset correction model to obtain the final cleaned soil data includes: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.
Further, the correction value model is expressed using the following formula: wherein lambda isThe ambient temperature, θ is ambient humidity, and ψ is ambient light intensity.
Data cleaning device in soil big data analysis.
According to the data cleaning method and device in the soil big data analysis, the data is cleaned in a mode of only searching the abnormal value of the data, which is different from the mode of the prior art, normal data are marked in a mode of constructing a data cube, abnormal data are corrected, and a correction model is constructed by combining the abnormality caused by the environmental data in the soil data, so that the accuracy of data cleaning is remarkably improved. The method is mainly realized through the following steps: amplification of data: according to the invention, normal data and abnormal data are distinguished for the first time through data amplification, the abnormal data obtained after distinguishing are obviously different from the normal data, and the distinguishing can improve the cleaning efficiency of the follow-up data; construction of a data cube: the built data cube can be used for carrying out more three-dimensional analysis on data, abnormal data can be directly locked, friction fit is correspondingly carried out on the data cubes through two layers of data cubes, if the data cubes can be used for friction fit, the data cubes are normal data, and if the data cubes cannot be used for friction fit, the method for finding the abnormal data is higher in accuracy, although the efficiency is lower than that of the prior art; data correction: according to the invention, soil data is corrected by combining the environment, temperature and humidity during soil data acquisition, so that the effectiveness and accuracy of the cleaning data obtained after soil data correction are improved.
Drawings
Fig. 1 is a schematic system structure diagram of a data cleaning method in soil big data analysis according to an embodiment of the present invention;
fig. 2 is a schematic diagram of data dispersion of a data cleaning method and a data cleaning device in soil big data analysis according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a principle of turning a dispersed data sphere inside a data cleaning cube in the data cleaning method and device in soil big data analysis according to the embodiment of the invention.
Detailed Description
The method of the invention is described in further detail below in connection with the attached dataset and the embodiments of the invention.
Example 1
As shown in fig. 1, a data cleaning method in soil big data analysis performs the steps of:
step 1: collecting soil data, and acquiring environmental data when the soil data are collected; the collected soil data at least comprises: soil effective water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly classifying collected soil data according to data types to obtain a plurality of classified data, and amplifying each classified data according to a set proportion to obtain scattered data;
step 3: constructing a dispersed data sphere based on the data structure and the data size of each dispersed data;
step 4: carrying out data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data, and respectively constructing a data cleaning cube of all pieces of scattered data by taking the data characteristics of each piece of scattered data as a center and taking the data radius of the scattered data as a side length;
step 5: placing the scattered data spheres in the data cleaning cube, turning the scattered data spheres in the data cleaning cube, wherein in the turning process, the scattered data on the surface of the scattered data spheres are contacted with the data cleaning cube, and each contacted scattered data is recorded;
step 6: and reserving recorded scattered data, carrying out data denoising on the scattered data which are not recorded, combining the scattered data after denoising with the recorded scattered data to obtain combined scattered data, and correcting the combined scattered data by using a preset correction model to obtain the final cleaned soil data.
Referring to fig. 2, fig. 2 is a schematic diagram of data distribution according to the present invention. After data dispersion is carried out, original data are amplified, compared with original data, the amplified data are unchanged in data rule, but abnormal data can be found more due to the obvious difference between the abnormal data and normal data, and therefore the accuracy of data cleaning is improved.
Referring to fig. 3, fig. 3 shows that after the obtained dispersed data spheres enter the data cleaning cube, the dispersed data distributed on the surface of the dispersed data spheres will be in contact with the data cleaning cube. In the dispersed data sphere, normal data can be washed when data contacted with the corresponding data is the cube, and data not contacted is abnormal data. Compared with the prior art, the method can be used for directly detecting the abnormal value of the data, so that hidden abnormal data can be found, and the accuracy of data cleaning is improved.
Example 2
On the basis of the above embodiment, the range of the ratio set in the step 2 is: 3 to 8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.
Example 3
On the basis of the above embodiment, the method for constructing the dispersion data sphere in the step 3 specifically includes: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere.
Specifically, the consistency check (consistency check) is to check whether the data is satisfactory or not according to the reasonable value range and the interrelationship of each variable, and find out the data which is out of the normal range, logically unreasonable or contradictory. For example, a variable measured on a scale of 1-7 has a value of 0 and a negative weight should be considered to be outside the normal range. Computer software such as SPSS, SAS, excel and the like can automatically identify each out-of-range variable value according to a defined value range. Answers with logical inconsistencies may appear in several forms: for example, many panelists say themselves drive to work and report that there is no car; or the panelist reports itself as a heavy purchaser and user of a brand, but at the same time gives a very low score on the familiarity scale. When inconsistent is found, the questionnaire serial number, the record serial number, the variable name, the error category and the like are listed, so that further verification and correction are facilitated.
Example 4
On the basis of the above embodiment, the data sphere construction model is expressed using the following formula:wherein C is the dispersion data, and min (C) is the minimum value of the dispersion data; max (C) is the maximum value of the scatter data; o (O) x The x-axis coordinate of the sphere center is obtained through calculation; o (O) y The y-axis coordinate of the sphere center is calculated; the z-axis coordinate of the sphere center is uniformly valued as 0; and (3) constructing a dispersed data sphere by taking the calculated data size of the dispersed data as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
Specifically, due to investigation, encoding and logging errors, there may be some invalid and missing values in the data, which need to be handled properly. The usual treatment methods are: estimating, whole case deletion, variable deletion and paired deletion.
Estimation (estimate). The simplest approach is to replace the invalid and missing values with the sample mean, median or mode of a certain variable. This approach is simple, but does not take into account the information already in the data adequately, and the error may be large. Another approach is to estimate the answers to other questions by correlation analysis or logical inference between variables based on the panelist. For example, the possession of a product may be related to household income, and the likelihood of possession of the product may be inferred from the household income of the panelist.
The whole deletion (casewise deletion) is to discard samples containing missing values. Since many questionnaires may have missing values, the result of this may be a significant reduction in the effective sample size, failing to make full use of the data already collected. Therefore, it is only suitable for the case that the critical variable is missing, or that the sample containing the invalid value or missing value has a small specific gravity.
Variable delete (variable deletion). If the invalid and missing values of a variable are numerous and the variable is not particularly important to the problem under study, then the variable may be considered for deletion. This reduces the number of variables for analysis, but does not change the sample size.
Example 5
On the basis of the above embodiment, the method for performing data analysis on each piece of scattered data in step 4 to obtain the data characteristics of all pieces of scattered data includes: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.
Example 6
On the basis of the above embodiment, the third feature distribution data is the same length as the second feature distribution data.
Abnormal data is generally divided into two categories:
1) Avoiding dirty data
Dirty data can be avoided, and as the name suggests, such dirty data can be directly processed into valid data or manually modified to avoid.
Such dirty data is quite common in everyday life, such as errors caused by naming irregularities, spelling errors, input errors, null values, etc.
2) Unavoidable dirty data
Unavoidable dirty data, the main forms including outliers, repeated values, null values, etc.; processing of such dirty data.
The common detection means 3 sigma law of abnormal values is checked (assuming that a group of detection data only contains random errors, the detection data is calculated to obtain standard deviation, a section is determined according to a certain probability, the error exceeding the section is considered to be not random errors but coarse errors, and the data containing the error should be removed, and in general, the section is the mean value plus or minus three standard deviations, thus the 3 sigma law is called).
Example 7
In the step of performing a first set of expansion normalization processing on the first feature distribution data to generate second feature distribution data on the basis of the above embodiment, the number of expansion normalization processing of the first set of expansion normalization processing is three.
Example 8
On the basis of the above embodiment, the method for correcting the combined dispersion data to obtain the final cleaned soil data in step 6 by using a preset correction model includes: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.
Specifically, the process of data cleansing, rechecking and verifying data aims at deleting duplicate information, correcting existing errors and providing data consistency.
Example 9
On the basis of the above embodiment, the correction value model is expressed using the following formula:wherein λ is ambient temperature, θ is ambient humidity, and ψ is ambient illumination intensity.
Specifically, because soil data is easily affected by environmental parameters during collection, the soil data needs to be corrected by combining the environmental data, and the corrected data can obviously improve the accuracy.
Example 10
Data cleaning device in soil big data analysis.
Although several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of the present disclosure.
The present examples are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein. For example, various elements or components may be combined or combined in another system, or certain features may be omitted or not implemented.
Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present invention. Other items shown or discussed as coupled or directly coupled or communicating with each other may also be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art without departing from the spirit and scope disclosed herein.

Claims (8)

1. A method for data cleaning in soil big data analysis, characterized in that the method performs the steps of:
step 1: collecting soil data, and acquiring environmental data when the soil data are collected; the collected soil data at least comprises: soil effective water content, sand content, silt content, clay content, soil volume weight and organic carbon content; the environmental data includes: ambient temperature, ambient humidity, and ambient light intensity;
step 2: performing data dispersion on the collected soil data according to the categories to obtain a plurality of dispersed data sets; the data dispersion process comprises the following steps: firstly classifying collected soil data according to data types to obtain a plurality of classified data, and amplifying each classified data according to a set proportion to obtain scattered data;
step 3: constructing a dispersed data sphere based on the data structure and the data size of each dispersed data;
step 4: carrying out data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data, and respectively constructing a data cleaning cube of all pieces of scattered data by taking the data characteristics of each piece of scattered data as a center and taking the data radius of the scattered data as a side length;
step 5: placing the scattered data spheres in the data cleaning cube, turning the scattered data spheres in the data cleaning cube, wherein in the turning process, the scattered data on the surface of the scattered data spheres are contacted with the data cleaning cube, and each contacted scattered data is recorded;
step 6: reserving recorded scattered data, carrying out data denoising on the scattered data which are not recorded, combining the scattered data after denoising with the recorded scattered data to obtain combined scattered data, and correcting the combined scattered data by using a preset correction model to obtain final cleaned soil data;
the method for constructing the dispersed data sphere in the step 3 specifically comprises the following steps: calculating the data size of the scattered data, taking the calculated data size of the scattered data as the radius of the scattered data sphere, and constructing a model by using a preset data sphere to construct a scattered data sphere so that the scattered data is uniformly distributed on the outer surface of the scattered data sphere;
the number isThe sphere-based build model is expressed using the following formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For dispersing data +.>Is the minimum value of the scattered data; />Is the maximum value of the scattered data; />For calculating the centre of sphereAn axis coordinate; />For the calculated centre of sphere +.>An axis coordinate; ball center +.>The unified value of the axis coordinates is 0; and (3) constructing a dispersed data sphere by taking the calculated data size of the dispersed data as the radius of the dispersed data sphere through the sphere center calculated by the sphere construction model.
2. The method according to claim 1, wherein the ratio set in step 2 is in the range of: 3-8; the value range depends on the type of the classified data; when the classification data is the effective water content of the soil, the set proportion value is 3; when the classified data is the sand content, the value of the set proportion is 4; when the classified data is the sludge content, the value of the set proportion is 5; when the classification data is clay content, the value of the set proportion is 6; when the classified data is soil volume weight, the value of the set proportion is 7; organic carbon content the organic carbon content was set to a value of 8.
3. The method of claim 1, wherein the step 4 of performing data analysis on each piece of scattered data to obtain data characteristics of all pieces of scattered data comprises: normalizing and averaging the scattered data to generate first characteristic scattered data, wherein n groups of normalization layers are used for normalization, and n-1 layers are used for averaging; performing a first group of expansion normalization processing on the first feature dispersion data to generate second feature dispersion data, and performing averaging processing on the first feature dispersion data to generate third feature dispersion data; splicing the third characteristic dispersion data and the second characteristic dispersion data to generate fourth characteristic dispersion data; the fourth feature distribution data is used as the data feature of the distribution data.
4. The method of claim 3, wherein the third feature dispersion data is the same length as the second feature dispersion data.
5. The method of claim 4 wherein in the step of performing a first set of dilation normalization processes on the first feature dispersion data to generate second feature dispersion data, the number of dilation normalization processes of the first set of dilation normalization processes is three.
6. The method of claim 5, wherein the step 6 of correcting the combined dispersion data using a predetermined correction model to obtain the final cleaned soil data comprises: based on the obtained ambient temperature, ambient humidity and ambient illumination intensity, a preset correction value model is used for calculating to obtain a correction value, and the correction value is multiplied by each piece of data in the combined scattered data to obtain final cleaned soil data.
7. The method of claim 6, wherein the correction value model is expressed using the following formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For ambient temperature->Is ambient humidity>Is the ambient light intensity.
8. Data cleaning apparatus for use in soil big data analysis for performing the method of any one of claims 1 to 7.
CN202210067946.4A 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis Active CN114443635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210067946.4A CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210067946.4A CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Publications (2)

Publication Number Publication Date
CN114443635A CN114443635A (en) 2022-05-06
CN114443635B true CN114443635B (en) 2024-04-09

Family

ID=81368576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210067946.4A Active CN114443635B (en) 2022-01-20 2022-01-20 Data cleaning method and device in soil big data analysis

Country Status (1)

Country Link
CN (1) CN114443635B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053659A2 (en) * 2002-12-10 2004-06-24 Stone Investments, Inc Method and system for analyzing data and creating predictive models
CN107741990A (en) * 2017-11-01 2018-02-27 深圳汇生通科技股份有限公司 Data cleansing integration method and system
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
CN112733417A (en) * 2020-11-16 2021-04-30 南京邮电大学 Abnormal load data detection and correction method and system based on model optimization
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053659A2 (en) * 2002-12-10 2004-06-24 Stone Investments, Inc Method and system for analyzing data and creating predictive models
CN107741990A (en) * 2017-11-01 2018-02-27 深圳汇生通科技股份有限公司 Data cleansing integration method and system
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
WO2021139249A1 (en) * 2020-05-28 2021-07-15 平安科技(深圳)有限公司 Data anomaly detection method, apparatus and device, and storage medium
CN112733417A (en) * 2020-11-16 2021-04-30 南京邮电大学 Abnormal load data detection and correction method and system based on model optimization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
数据清洗在医疗大数据分析中的应用;毛云鹏;龙虎;邓韧;郭欣;;中国数字医学;20170615(06);全文 *
营销数据清洗及治理方法的研究及应用;梁卫宁;周钰书;唐文彬;刘森;陈玲娜;;信息技术与信息化;20200728(07);全文 *

Also Published As

Publication number Publication date
CN114443635A (en) 2022-05-06

Similar Documents

Publication Publication Date Title
Lee et al. Intelliclean: a knowledge-based intelligent data cleaner
Low et al. A knowledge-based approach for duplicate elimination in data cleaning
CN109935336B (en) Intelligent auxiliary diagnosis system for respiratory diseases of children
Maletic et al. Data cleansing
Chien et al. A system for online detection and classification of wafer bin map defect patterns for manufacturing intelligence
CN111222458A (en) Rolling bearing fault diagnosis method based on ensemble empirical mode decomposition and convolutional neural network
CN107741990B (en) Data cleaning integration method and system
Wang et al. Defect pattern recognition on wafers using convolutional neural networks
CN110389950B (en) Rapid running big data cleaning method
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context
CN112597238A (en) Method, system, device and medium for establishing knowledge graph based on personnel information
Shi et al. An improved agglomerative hierarchical clustering anomaly detection method for scientific data
CN114443635B (en) Data cleaning method and device in soil big data analysis
Natarajan et al. Data mining techniques for data cleaning
Huang et al. Importance of data quality in virtual metrology
CN116862109A (en) Regional carbon emission situation awareness early warning method
CN116028847A (en) Universal method and system for automatic intelligent diagnosis of turbine mechanical faults
Pahwa et al. An efficient algorithm for data cleaning
CN115756919A (en) Root cause positioning method and system for multidimensional data
CN114398942A (en) Personal income tax abnormity detection method and device based on integration
Shilpika et al. Toward an in-depth analysis of multifidelity high performance computing systems
Bashir et al. Matlab-based graphical user interface for IOT sensor measurements subject to outlier
CN112559499A (en) Data mining system and method
Wang et al. Decision tree classification algorithm for non-equilibrium data set based on random forests
Zhao An empirical study of data mining in performance evaluation of HRM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant