CN112800049A - EXCEL data source cleaning method and system based on big data, electronic device and storage medium - Google Patents

EXCEL data source cleaning method and system based on big data, electronic device and storage medium Download PDF

Info

Publication number
CN112800049A
CN112800049A CN202110364627.5A CN202110364627A CN112800049A CN 112800049 A CN112800049 A CN 112800049A CN 202110364627 A CN202110364627 A CN 202110364627A CN 112800049 A CN112800049 A CN 112800049A
Authority
CN
China
Prior art keywords
data
excel
data source
standard
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110364627.5A
Other languages
Chinese (zh)
Other versions
CN112800049B (en
Inventor
孙东祥
常卫涛
张坤
郑媛媛
王茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Shenzhou Wisdom System Technology Co ltd
Original Assignee
Aerospace Shenzhou Wisdom System Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Shenzhou Wisdom System Technology Co ltd filed Critical Aerospace Shenzhou Wisdom System Technology Co ltd
Priority to CN202110364627.5A priority Critical patent/CN112800049B/en
Publication of CN112800049A publication Critical patent/CN112800049A/en
Application granted granted Critical
Publication of CN112800049B publication Critical patent/CN112800049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The invention relates to a method, a system, an electronic device and a storage medium for cleaning an EXCEL data source based on big data, wherein the method comprises the following steps: analyzing and structuring an EXCEL data source; carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source; cleaning the standardized EXCEL data source; and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information. According to the technical scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the following big data.

Description

EXCEL data source cleaning method and system based on big data, electronic device and storage medium
Technical Field
The invention relates to the technical field of data cleaning, in particular to a method and a system for cleaning an EXCEL data source based on big data, electronic equipment and a storage medium.
Background
The construction of smart cities needs the support of big data technology, the current big data field mainly aims at the mining, analyzing and using of data, and the processing of data standard and accuracy is given to users, so that huge workload is brought to the users. Moreover, users spend a lot of time and energy, and the accuracy of the data collated by hands is not necessarily high.
Various industries have a large amount of data with different types, and the data have various problems, which cause great obstacles to the accurate use of the data. To eliminate the obstacle, the data needs to be cleaned to obtain accurate and high-quality data.
The storage mode of data in each industry mainly comprises EXCEL and various databases, the stored structures are various, and if data cleaning is needed, manual carding of data with different structures and types is needed, so that labor cost waste is caused.
Most data in EXCEL are very poor in quality and reliability. The analysis and the discovery of the data information are influenced, and wrong reference is provided for the decision.
Disclosure of Invention
The present invention is directed to solving at least one of the above problems in the background art, and provides a method, a system, an electronic device and a storage medium for cleaning EXCEL data source based on big data.
In order to achieve the above object, the present invention provides a method for cleaning EXCEL data source based on big data, including:
analyzing and structuring an EXCEL data source;
carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
cleaning the standardized EXCEL data source;
and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
According to one aspect of the invention, an EXCEL data source is parsed and structured comprising:
uploading an EXCEL data source, and designating the number of lines of a title in the data source;
distinguishing a header line and a data area according to the header line number;
automatically constructing a data model according to the last line of the title, and defining the name of a corresponding field;
establishing a corresponding relation between the field and the title;
and storing the data of the EXCEL data source into a database.
According to one aspect of the invention, the normalization of the key attribute names of the data in the analyzed and structured EXCEL data source is to match the key field data in the EXCEL data source with the standard data.
According to one aspect of the present invention, the cleaning of the normalized EXCEL data source comprises:
preprocessing data in the EXCEL data source;
establishing a knowledge base model, comparing the data in the preprocessed EXCEL data source with the non-standard data stored in the knowledge base model, and if the data in the EXCEL data source is equal to the non-standard data stored in the knowledge base model, determining that the data in the EXCEL data source is the corresponding standard data;
and constructing a standard library provided with standard data, deeply cleaning the data in the EXCEL data source, confirming the data similar to the standard data in the standard library, and replacing the data with the standard data in the standard library.
According to one aspect of the invention, the pre-processing comprises:
removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;
replacing the blank with a symbol by using a character replacement method in JAVA, and removing all blanks in the character string;
converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;
and checking the mobile phone number by using a regular expression.
According to one aspect of the present invention, the corresponding data in the standard library is found according to the key fields in the EXCEL data source by using a cosine value algorithm, wherein the cosine value algorithm is as follows:
Figure 791160DEST_PATH_IMAGE001
in the formula: x and y represent two vectors, i represents the latitude of the vector, xiCoordinate point, y, representing the ith latitude of vector xiA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;
the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.
According to one aspect of the invention, standard library data corresponding to data in the EXCEL data source is listed, standard data matching the data in the EXCEL data source is validated, and after validation, the data in the EXCEL data source is directly replaced with the data of the standard library using the update method of sql.
To achieve the above object, the present invention further provides a complex EXCEL data source cleaning system, including:
the data analysis module is used for analyzing and structuring the EXCEL data source;
the standardization processing module is used for carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
the data cleaning module is used for cleaning the standardized EXCEL data source;
and the data standard matching module is used for performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
To achieve the above object, the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the computer program implements the above method when executed by the processor.
To achieve the above object, the present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the above method.
According to the technical scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the following big data. The accurate raw data facilitates the analysis and mining of accurate data information, thereby providing a more accurate reference for corresponding decisions.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
FIG. 1 is a flow chart that schematically illustrates a method for cleaning a big data based EXCEL data source in accordance with the present invention;
FIG. 2 schematically represents a block diagram of a big data based EXCEL data source cleansing system in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention.
FIG. 1 schematically represents a flow chart of a big data based EXCEL data source cleansing method in accordance with the present invention. As shown in FIG. 1, the EXCEL data source cleaning method based on big data according to the invention comprises the following steps:
a. analyzing and structuring an EXCEL data source;
b. carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
c. cleaning the standardized EXCEL data source;
d. and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
According to one embodiment of the present invention, parsing and structuring an EXCEL data source comprises:
uploading an EXCEL data source, and designating the number of lines of a title in a list;
distinguishing a header line and a data area according to the header line number;
automatically constructing a data model according to the last line of the title, and defining the name of a corresponding field;
establishing a corresponding relation between the field and the title;
and storing the data of the EXCEL data source into a database.
Further, the data in the analyzed and structured EXCEL data source is subjected to the standardization of key attribute names, namely, the matching of the key field data in the EXCEL data source and the standard data is carried out.
Further, cleaning the normalized EXCEL data source includes:
preprocessing data in the EXCEL data source;
establishing a knowledge base model, comparing the data in the preprocessed EXCEL data source with the non-standard data stored in the knowledge base model, and if the data in the EXCEL data source is equal to the non-standard data stored in the knowledge base model, determining that the data in the EXCEL data source is the corresponding standard data;
and constructing a standard library provided with standard data, deeply cleaning the data in the EXCEL data source, confirming the data similar to the standard data in the standard library, and replacing the data with the standard data in the standard library.
Further, the pre-processing comprises:
removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;
replacing the blank with a symbol by using a character replacement method in JAVA, and removing all blanks in the character string;
converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;
and checking the mobile phone number by using a regular expression.
Using a cosine value algorithm to find corresponding data in a standard library according to key fields in an EXCEL data source, wherein the cosine value algorithm is as follows:
Figure 570897DEST_PATH_IMAGE001
in the formula: x and y represent two vectors, i represents the latitude of the vector, xiCoordinate point, y, representing the ith latitude of vector xiA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;
the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.
Listing standard library data corresponding to the data in the EXCEL data source, confirming the standard data matched with the data in the EXCEL data source, and directly replacing the data in the EXCEL data source with the data of the standard library by using the update method of sql after confirmation.
According to the scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the subsequent big data.
In order to achieve the above object, the present invention further provides a big data based EXCEL data source cleaning system, a block diagram of which is shown in fig. 2, the system comprising:
the data analysis module is used for analyzing and structuring the EXCEL data source;
the standardization processing module is used for carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
the data cleaning module is used for cleaning the standardized EXCEL data source;
and the data standard matching module is used for performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
According to one embodiment of the present invention, a data parsing module parses and structures an EXCEL data source, comprising:
uploading an EXCEL data source, and designating the number of lines of a title in a list;
distinguishing a header line and a data area according to the header line number;
automatically constructing a data model according to the last line of the title, and defining the name of a corresponding field;
establishing a corresponding relation between the field and the title;
and storing the data of the EXCEL data source into a database.
Further, the normalization processing module normalizes the key attribute name of the data in the analyzed and structured EXCEL data source to match the key field data in the EXCEL data source with the standard data.
Further, the data cleansing module cleanses the normalized EXCEL data source, including:
preprocessing data in the EXCEL data source;
establishing a knowledge base model, comparing the data in the preprocessed EXCEL data source with the non-standard data stored in the knowledge base model, and if the data in the EXCEL data source is equal to the non-standard data stored in the knowledge base model, determining that the data in the EXCEL data source is the corresponding standard data;
and constructing a standard library provided with standard data, deeply cleaning the data in the EXCEL data source, confirming the data similar to the standard data in the standard library, and replacing the data with the standard data in the standard library.
Further, the pre-processing comprises:
removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;
replacing the blank with a symbol by using a character replacement method in JAVA, and removing all blanks in the character string;
converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;
and checking the mobile phone number by using a regular expression.
Using a cosine value algorithm to find corresponding data in a standard library according to key fields in an EXCEL data source, wherein the cosine value algorithm is as follows:
Figure 841473DEST_PATH_IMAGE001
in the formula: x and y are eachRespectively two vectors, i denotes the latitude of the vector, xiCoordinate point, y, representing the ith latitude of vector xiA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;
the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.
Listing standard library data corresponding to the data in the EXCEL data source, confirming the standard data matched with the data in the EXCEL data source, and directly replacing the data in the EXCEL data source with the data of the standard library by using the update method of sql after confirmation.
According to the scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the subsequent big data.
To achieve the above object, the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the computer program implements the above method when executed by the processor.
To achieve the above object, the present invention further provides a computer-readable storage medium on which a computer program is stored, the computer program implementing the above method when executed by a processor.
The invention will be described in detail below with reference to the accompanying drawings by way of an example.
Example 1
Inputting: one EXCEL list with nonstandard data and appointing the number of title lines;
and (3) outputting: an EXCEL list of data standards;
the treatment process comprises the following steps:
the title and data are distinguished according to the EXCEL title line number titleNum. The first to the last lines are title areas, and the (titleNum + 1) to the last lines are data areas;
the technique of using JAVA POI analyzes the data of the title area and the data area of the EXCEL list:
analyzing the suffix of the EXCEL file, and judging whether the suffix is 'XLSX' or 'XLS';
creating a corresponding workbook according to different suffixes;
analyzing a first sheet page in the workbook;
circularly analyzing each row of data in the sheet page;
circularly analyzing each cell in each row;
and reading the data in the cells and storing the data in the memory.
The read title is stored in the T _ DATA _ SOURCE _ COLUMN table using the jdbc method. From the data of the read header area, a corresponding table structure is created, the fields of which are named one by one according to the header (STR 1, STR2, STR 3.......).
Data source tables, data cleansing tables are built from headers using data modeling techniques, defining corresponding field names (STR 1, STR2, STR 3.). And storing the data of the data area into a data source table and a data cleaning table.
Pretreatment: (removing front and back spaces, removing all spaces, converting from lower case to upper case, verifying the mobile phone number, and writing) by using a java method;
removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;
replacing the blank with 'the space' by using a method for replacing characters in JAVA, and removing all the blanks in the character string;
converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;
checking the mobile phone number by using a regular expression "^ ((13[0-9]) | (15[ ^4, \ \ D ]) | (18[0,5-9])) \ \ D {8} $";
cleaning by using a knowledge base:
and (3) constructing a knowledge base model, comparing the list data with the non-standard data in the knowledge base, and if the list data are equal to the non-standard data in the knowledge base, changing the data into the corresponding standard data by using an update method of sql.
The knowledge base stores non-standard data and corresponding standard data.
The model structure is as follows: t _ CORE _ FIELD (FIELD table to be standardized), T _ CORE _ FIELD _ STD (standard data table), T _ CORE _ FIELD _ NO _ STD (non-standard data ratio table)
Deep washing and artificial confirmation:
firstly, a standard library (std _ lib) is constructed, and a set of standard data is stored in the standard library.
Similar data in the standard library is found using a cosine value algorithm based on the key fields in the list. Then manually confirming which data corresponds to the similar data, and finally replacing the data in the list with standard data by using the update statement of sql.
Cosine value algorithm:
Figure 655845DEST_PATH_IMAGE001
in the formula: x and y represent two vectors, i represents the latitude of the vector, xiCoordinate point, y, representing the ith latitude of vector xiA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;
the cosine value is closer to 1, which indicates that the included angle is closer to 0 degree, i.e. the two vectors are more similar, the included angle is equal to 0, i.e. the two vectors are equal, which is called "cosine similarity".
Examples are:
the name A: 12 New Zealand red rose Queen apples with weight of more than 140 g/apple
The name B: new Zealand red rose Queen apples with 6 fruits and over 150 g/apple
First step, sentence splitting
The name A: 12 or more than 140g new zealand red rose Q u e n apples;
the name B: 6 or more 150g new zealand red rose Q u e n apples;
second, list all the combinations of the individual characters (deduplication)
General statements: new Zealand red rose Q u e n apples 12 pieces 40g or more/65
Third, calculating the word frequency
The name A: new [1] Ceylon [1] Red [1] Rose [1] Q [1] u [1] e [2] n [1] apple [1]1[1]2[ 1] 4[1]0[1] g [1] upper [1]6[0]5[0]
The name B: new [1] Ceylon [1] Red [1] Rose [1] Q [1] u [1] e [2] n [1] apple [1]1[1]2[0] 0[1] g [1] upper [1]6[1]5[1]
And fourthly, writing out word frequency vectors.
The name A: (1,1,1,1,1,1,1,1,2,1,1,1,2,1,2,1,1,1,1,1,1,0,0)
The name B: (1,1,1,1,1,1,1,1,2,1,1,1,1,0,2,0,1,1,1,1,1,1,1)
Fifthly, applying formulas
Value = (a 1 × B1+ a2 × B2+ A3 × B3+ a1 × B1+. said product is obtained by dividing the sum of squares of a by the number of roots (the sum of squares of B)
Value = 26/root (30) root (27)
Value = 26/28.4604
=0.9135
The larger the value is, the more similar, =1 represents the exact same. =0 represents a complete difference
Listing standard library data similar to the list data, manually confirming which data is matched with the list data, and directly replacing the list data with the standard library data by using an update method of sql after confirmation.
The cleaned data list is derived using the POI technique of java.
Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims (10)

1. A method for cleaning EXCEL data source based on big data is characterized by comprising the following steps:
analyzing and structuring an EXCEL data source;
carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
cleaning the standardized EXCEL data source;
and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
2. The EXCEL data source cleaning method based on big data according to claim 1, parsing and structuring EXCEL data source comprising:
uploading an EXCEL data source, and designating the number of lines of a title in the data source;
distinguishing a header line and a data area according to the header line number;
automatically constructing a data model according to the last line of the title, and defining the name of a corresponding field;
establishing a corresponding relation between the field and the title;
and storing the data of the EXCEL data source into a database.
3. The EXCEL data source cleaning method based on big data according to claim 1, characterized in that the normalization process of key attribute names to the data in the resolved and structured EXCEL data source is to match the key field data in the EXCEL data source with the standard data.
4. The EXCEL data source cleaning method based on big data according to claim 1, characterized in that cleaning the normalized EXCEL data source comprises:
preprocessing data in the EXCEL data source;
establishing a knowledge base model, comparing the data in the preprocessed EXCEL data source with the non-standard data stored in the knowledge base model, and if the data in the EXCEL data source is equal to the non-standard data stored in the knowledge base model, determining that the data in the EXCEL data source is the corresponding standard data;
and constructing a standard library provided with standard data, deeply cleaning the data in the EXCEL data source, confirming the data similar to the standard data in the standard library, and replacing the data with the standard data in the standard library.
5. The EXCEL data source cleaning method based on big data according to claim 4, characterized in that the preprocessing comprises:
removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;
replacing the blank with a symbol by using a character replacement method in JAVA, and removing all blanks in the character string;
converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;
and checking the mobile phone number by using a regular expression.
6. The EXCEL data source cleaning method based on big data according to claim 5, characterized in that the corresponding data in the standard library is found from the key fields in the EXCEL data source using cosine value algorithm, wherein the cosine value algorithm is:
Figure 481976DEST_PATH_IMAGE001
in the formula: x and y represent two vectors, i represents the latitude of the vector, xiCoordinate point, y, representing the ith latitude of vector xiA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;
the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.
7. The EXCEL data source cleaning method based on big data according to claim 6, characterized in that the standard library data corresponding to the data in EXCEL data source is listed, the standard data matching the data in EXCEL data source is confirmed, after confirmation, the data in EXCEL data source is directly replaced with the data of standard library using sql update method.
8. An EXCEL data source cleaning system based on big data, comprising:
the data analysis module is used for analyzing and structuring the EXCEL data source;
the standardization processing module is used for carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;
the data cleaning module is used for cleaning the standardized EXCEL data source;
and the data standard matching module is used for performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.
9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110364627.5A 2021-04-06 2021-04-06 EXCEL data source cleaning method and system based on big data, electronic device and storage medium Active CN112800049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110364627.5A CN112800049B (en) 2021-04-06 2021-04-06 EXCEL data source cleaning method and system based on big data, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110364627.5A CN112800049B (en) 2021-04-06 2021-04-06 EXCEL data source cleaning method and system based on big data, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN112800049A true CN112800049A (en) 2021-05-14
CN112800049B CN112800049B (en) 2021-08-03

Family

ID=75816300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110364627.5A Active CN112800049B (en) 2021-04-06 2021-04-06 EXCEL data source cleaning method and system based on big data, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN112800049B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731791A (en) * 2013-12-18 2015-06-24 东阳艾维德广告传媒有限公司 Marketing analysis data market system
CN107562701A (en) * 2017-08-22 2018-01-09 上海找钢网信息科技股份有限公司 A kind of data analysis method and its system of steel trade industry stock resource
US10114884B1 (en) * 2015-12-16 2018-10-30 Palantir Technologies Inc. Systems and methods for attribute analysis of one or more databases
CN109241397A (en) * 2018-07-06 2019-01-18 四川斐讯信息技术有限公司 A kind of method and apparatus for cleaning data
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110389950A (en) * 2019-07-31 2019-10-29 南京安夏电子科技有限公司 A kind of big data cleaning method quickly run
CN111125076A (en) * 2019-12-17 2020-05-08 武汉海云健康科技股份有限公司 Big data based medicine universal name cleaning method and system, server and medium
CN111597292A (en) * 2020-04-20 2020-08-28 安徽慧医信息科技有限公司 Text formatting cleaning method based on webpage label position
CN111639066A (en) * 2020-05-14 2020-09-08 杭州数梦工场科技有限公司 Data cleaning method and device
CN111858567A (en) * 2020-06-18 2020-10-30 南京市江宁区信息化管理服务中心 Method and system for cleaning government affair data through standard data elements
CN112181949A (en) * 2020-10-10 2021-01-05 浪潮云信息技术股份公司 Online data modeling method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731791A (en) * 2013-12-18 2015-06-24 东阳艾维德广告传媒有限公司 Marketing analysis data market system
US10114884B1 (en) * 2015-12-16 2018-10-30 Palantir Technologies Inc. Systems and methods for attribute analysis of one or more databases
CN107562701A (en) * 2017-08-22 2018-01-09 上海找钢网信息科技股份有限公司 A kind of data analysis method and its system of steel trade industry stock resource
CN109241397A (en) * 2018-07-06 2019-01-18 四川斐讯信息技术有限公司 A kind of method and apparatus for cleaning data
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110389950A (en) * 2019-07-31 2019-10-29 南京安夏电子科技有限公司 A kind of big data cleaning method quickly run
CN111125076A (en) * 2019-12-17 2020-05-08 武汉海云健康科技股份有限公司 Big data based medicine universal name cleaning method and system, server and medium
CN111597292A (en) * 2020-04-20 2020-08-28 安徽慧医信息科技有限公司 Text formatting cleaning method based on webpage label position
CN111639066A (en) * 2020-05-14 2020-09-08 杭州数梦工场科技有限公司 Data cleaning method and device
CN111858567A (en) * 2020-06-18 2020-10-30 南京市江宁区信息化管理服务中心 Method and system for cleaning government affair data through standard data elements
CN112181949A (en) * 2020-10-10 2021-01-05 浪潮云信息技术股份公司 Online data modeling method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KANSONG-CRYPTO: "使用java解析excel表格(包含表头判断)", 《HTTPS://BLOG.CSDN.NET/QQ_38932871/ARTICLE/DETAILS/109230492》 *
PLEASECALLMETEN: "POI读取EXCEL顶端标题行属性", 《HTTPS://BLOG.CSDN.NET/KAIYUANTAO/ARTICLE/DETAILS/44461569》 *
友盟全域数据: "数据挖掘中常用的数据清洗方法有哪些?", 《HTTPS://WWW.ZHIHU.COM/QUESTION/22077960/ANSWER/1610022292》 *
郑贵福: "面向政府数据开放的数据清洗框架与应用研究", 《中国优秀硕士学位论文全文数据库 社会科学I辑》 *

Also Published As

Publication number Publication date
CN112800049B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN110795543A (en) Unstructured data extraction method and device based on deep learning and storage medium
CN110569353A (en) Attention mechanism-based Bi-LSTM label recommendation method
CN105095444A (en) Information acquisition method and device
CN111597356B (en) Intelligent education knowledge map construction system and method
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN109739997A (en) Address control methods, apparatus and system
CN110599839A (en) Online examination method and system based on intelligent paper grouping and text analysis review
CN111339269A (en) Knowledge graph question-answer training and application service system with automatically generated template
CN111581376A (en) Automatic knowledge graph construction system and method
CN111460145A (en) Learning resource recommendation method, device and storage medium
CN112069327A (en) Knowledge graph construction method and system for teaching resources of online education classroom
CN110781681A (en) Translation model-based elementary mathematic application problem automatic solving method and system
CN111222345A (en) Place name address visualization analysis method based on semantic word segmentation technology
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN112800049B (en) EXCEL data source cleaning method and system based on big data, electronic device and storage medium
CN101114282A (en) Participle processing method and equipment
CN112231448A (en) Intelligent document question and answer method and device
CN117290404A (en) Method and system for rapidly searching and practical main distribution network fault processing method
Kelil et al. A general measure of similarity for categorical sequences
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
Connaway et al. Publisher names in bibliographic data
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
US20220156611A1 (en) Method and apparatus for entering information, electronic device, computer readable storage medium
CN113420564B (en) Hybrid matching-based electric power nameplate semantic structuring method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant