CN112800049A

CN112800049A - EXCEL data source cleaning method and system based on big data, electronic device and storage medium

Info

Publication number: CN112800049A
Application number: CN202110364627.5A
Authority: CN
Inventors: 孙东祥; 常卫涛; 张坤; 郑媛媛; 王茹
Original assignee: Aerospace Shenzhou Wisdom System Technology Co ltd
Current assignee: Aerospace Shenzhou Wisdom System Technology Co ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-05-14
Anticipated expiration: 2041-04-06
Also published as: CN112800049B

Abstract

The invention relates to a method, a system, an electronic device and a storage medium for cleaning an EXCEL data source based on big data, wherein the method comprises the following steps: analyzing and structuring an EXCEL data source; carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source; cleaning the standardized EXCEL data source; and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information. According to the technical scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the following big data.

Description

EXCEL data source cleaning method and system based on big data, electronic device and storage medium

Technical Field

The invention relates to the technical field of data cleaning, in particular to a method and a system for cleaning an EXCEL data source based on big data, electronic equipment and a storage medium.

Background

The construction of smart cities needs the support of big data technology, the current big data field mainly aims at the mining, analyzing and using of data, and the processing of data standard and accuracy is given to users, so that huge workload is brought to the users. Moreover, users spend a lot of time and energy, and the accuracy of the data collated by hands is not necessarily high.

Various industries have a large amount of data with different types, and the data have various problems, which cause great obstacles to the accurate use of the data. To eliminate the obstacle, the data needs to be cleaned to obtain accurate and high-quality data.

The storage mode of data in each industry mainly comprises EXCEL and various databases, the stored structures are various, and if data cleaning is needed, manual carding of data with different structures and types is needed, so that labor cost waste is caused.

Most data in EXCEL are very poor in quality and reliability. The analysis and the discovery of the data information are influenced, and wrong reference is provided for the decision.

Disclosure of Invention

The present invention is directed to solving at least one of the above problems in the background art, and provides a method, a system, an electronic device and a storage medium for cleaning EXCEL data source based on big data.

In order to achieve the above object, the present invention provides a method for cleaning EXCEL data source based on big data, including:

analyzing and structuring an EXCEL data source;

carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;

cleaning the standardized EXCEL data source;

and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.

According to one aspect of the invention, an EXCEL data source is parsed and structured comprising:

uploading an EXCEL data source, and designating the number of lines of a title in the data source;

distinguishing a header line and a data area according to the header line number;

automatically constructing a data model according to the last line of the title, and defining the name of a corresponding field;

establishing a corresponding relation between the field and the title;

and storing the data of the EXCEL data source into a database.

According to one aspect of the invention, the normalization of the key attribute names of the data in the analyzed and structured EXCEL data source is to match the key field data in the EXCEL data source with the standard data.

According to one aspect of the present invention, the cleaning of the normalized EXCEL data source comprises:

preprocessing data in the EXCEL data source;

establishing a knowledge base model, comparing the data in the preprocessed EXCEL data source with the non-standard data stored in the knowledge base model, and if the data in the EXCEL data source is equal to the non-standard data stored in the knowledge base model, determining that the data in the EXCEL data source is the corresponding standard data;

and constructing a standard library provided with standard data, deeply cleaning the data in the EXCEL data source, confirming the data similar to the standard data in the standard library, and replacing the data with the standard data in the standard library.

According to one aspect of the invention, the pre-processing comprises:

removing the front and back spaces in the data by using a method for removing the front and back spaces in JAVA;

replacing the blank with a symbol by using a character replacement method in JAVA, and removing all blanks in the character string;

converting the lower case letters of the data into upper case letters by using a method of converting the lower case letters into the upper case letters in JAVA;

and checking the mobile phone number by using a regular expression.

According to one aspect of the present invention, the corresponding data in the standard library is found according to the key fields in the EXCEL data source by using a cosine value algorithm, wherein the cosine value algorithm is as follows:

；

in the formula: x and y represent two vectors, i represents the latitude of the vector, x_iCoordinate point, y, representing the ith latitude of vector x_iA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;

the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.

According to one aspect of the invention, standard library data corresponding to data in the EXCEL data source is listed, standard data matching the data in the EXCEL data source is validated, and after validation, the data in the EXCEL data source is directly replaced with the data of the standard library using the update method of sql.

To achieve the above object, the present invention further provides a complex EXCEL data source cleaning system, including:

the data analysis module is used for analyzing and structuring the EXCEL data source;

the standardization processing module is used for carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;

the data cleaning module is used for cleaning the standardized EXCEL data source;

and the data standard matching module is used for performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.

To achieve the above object, the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the computer program implements the above method when executed by the processor.

To achieve the above object, the present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the above method.

According to the technical scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the following big data. The accurate raw data facilitates the analysis and mining of accurate data information, thereby providing a more accurate reference for corresponding decisions.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart that schematically illustrates a method for cleaning a big data based EXCEL data source in accordance with the present invention;

FIG. 2 schematically represents a block diagram of a big data based EXCEL data source cleansing system in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention.

FIG. 1 schematically represents a flow chart of a big data based EXCEL data source cleansing method in accordance with the present invention. As shown in FIG. 1, the EXCEL data source cleaning method based on big data according to the invention comprises the following steps:

a. analyzing and structuring an EXCEL data source;

b. carrying out standardization processing on key attribute names on the data in the analyzed and structured EXCEL data source;

c. cleaning the standardized EXCEL data source;

d. and performing standard matching on the cleaned EXCEL data source according to a standard database and perfecting data information.

According to one embodiment of the present invention, parsing and structuring an EXCEL data source comprises:

uploading an EXCEL data source, and designating the number of lines of a title in a list;

establishing a corresponding relation between the field and the title;

and storing the data of the EXCEL data source into a database.

Further, the data in the analyzed and structured EXCEL data source is subjected to the standardization of key attribute names, namely, the matching of the key field data in the EXCEL data source and the standard data is carried out.

Further, cleaning the normalized EXCEL data source includes:

preprocessing data in the EXCEL data source;

Further, the pre-processing comprises:

and checking the mobile phone number by using a regular expression.

Using a cosine value algorithm to find corresponding data in a standard library according to key fields in an EXCEL data source, wherein the cosine value algorithm is as follows:

；

Listing standard library data corresponding to the data in the EXCEL data source, confirming the standard data matched with the data in the EXCEL data source, and directly replacing the data in the EXCEL data source with the data of the standard library by using the update method of sql after confirmation.

According to the scheme of the invention, the accuracy of data processing can be effectively improved, the workload of a user is relieved, and data guarantee is provided for the analysis and the use of the subsequent big data.

In order to achieve the above object, the present invention further provides a big data based EXCEL data source cleaning system, a block diagram of which is shown in fig. 2, the system comprising:

According to one embodiment of the present invention, a data parsing module parses and structures an EXCEL data source, comprising:

establishing a corresponding relation between the field and the title;

and storing the data of the EXCEL data source into a database.

Further, the normalization processing module normalizes the key attribute name of the data in the analyzed and structured EXCEL data source to match the key field data in the EXCEL data source with the standard data.

Further, the data cleansing module cleanses the normalized EXCEL data source, including:

preprocessing data in the EXCEL data source;

Further, the pre-processing comprises:

and checking the mobile phone number by using a regular expression.

；

in the formula: x and y are eachRespectively two vectors, i denotes the latitude of the vector, x_iCoordinate point, y, representing the ith latitude of vector x_iA coordinate point representing the ith latitude of the vector y, theta represents the included angle between the x vector and the y vector, and n represents that the x vector and the y vector are n latitudes;

To achieve the above object, the present invention further provides a computer-readable storage medium on which a computer program is stored, the computer program implementing the above method when executed by a processor.

The invention will be described in detail below with reference to the accompanying drawings by way of an example.

Example 1

Inputting: one EXCEL list with nonstandard data and appointing the number of title lines;

and (3) outputting: an EXCEL list of data standards;

the treatment process comprises the following steps:

the title and data are distinguished according to the EXCEL title line number titleNum. The first to the last lines are title areas, and the (titleNum + 1) to the last lines are data areas;

the technique of using JAVA POI analyzes the data of the title area and the data area of the EXCEL list:

analyzing the suffix of the EXCEL file, and judging whether the suffix is 'XLSX' or 'XLS';

creating a corresponding workbook according to different suffixes;

analyzing a first sheet page in the workbook;

circularly analyzing each row of data in the sheet page;

circularly analyzing each cell in each row;

and reading the data in the cells and storing the data in the memory.

The read title is stored in the T _ DATA _ SOURCE _ COLUMN table using the jdbc method. From the data of the read header area, a corresponding table structure is created, the fields of which are named one by one according to the header (STR 1, STR2, STR 3.......).

Data source tables, data cleansing tables are built from headers using data modeling techniques, defining corresponding field names (STR 1, STR2, STR 3.). And storing the data of the data area into a data source table and a data cleaning table.

Pretreatment: (removing front and back spaces, removing all spaces, converting from lower case to upper case, verifying the mobile phone number, and writing) by using a java method;

replacing the blank with 'the space' by using a method for replacing characters in JAVA, and removing all the blanks in the character string;

checking the mobile phone number by using a regular expression "^ ((13[0-9]) | (15[ ^4, \ \ D ]) | (18[0,5-9])) \ \ D {8} $";

cleaning by using a knowledge base:

and (3) constructing a knowledge base model, comparing the list data with the non-standard data in the knowledge base, and if the list data are equal to the non-standard data in the knowledge base, changing the data into the corresponding standard data by using an update method of sql.

The knowledge base stores non-standard data and corresponding standard data.

The model structure is as follows: t _ CORE _ FIELD (FIELD table to be standardized), T _ CORE _ FIELD _ STD (standard data table), T _ CORE _ FIELD _ NO _ STD (non-standard data ratio table)

Deep washing and artificial confirmation:

firstly, a standard library (std _ lib) is constructed, and a set of standard data is stored in the standard library.

Similar data in the standard library is found using a cosine value algorithm based on the key fields in the list. Then manually confirming which data corresponds to the similar data, and finally replacing the data in the list with standard data by using the update statement of sql.

Cosine value algorithm:

；

the cosine value is closer to 1, which indicates that the included angle is closer to 0 degree, i.e. the two vectors are more similar, the included angle is equal to 0, i.e. the two vectors are equal, which is called "cosine similarity".

Examples are:

the name A: 12 New Zealand red rose Queen apples with weight of more than 140 g/apple

The name B: new Zealand red rose Queen apples with 6 fruits and over 150 g/apple

First step, sentence splitting

The name A: 12 or more than 140g new zealand red rose Q u e n apples;

the name B: 6 or more 150g new zealand red rose Q u e n apples;

second, list all the combinations of the individual characters (deduplication)

General statements: new Zealand red rose Q u e n apples 12 pieces 40g or more/65

Third, calculating the word frequency

The name A: new [1] Ceylon [1] Red [1] Rose [1] Q [1] u [1] e [2] n [1] apple [1]1[1]2[ 1] 4[1]0[1] g [1] upper [1]6[0]5[0]

The name B: new [1] Ceylon [1] Red [1] Rose [1] Q [1] u [1] e [2] n [1] apple [1]1[1]2[0] 0[1] g [1] upper [1]6[1]5[1]

And fourthly, writing out word frequency vectors.

The name A: (1,1,1,1,1,1,1,1,2,1,1,1,2,1,2,1,1,1,1,1,1,0,0)

The name B: (1,1,1,1,1,1,1,1,2,1,1,1,1,0,2,0,1,1,1,1,1,1,1)

Fifthly, applying formulas

Value = (a 1 × B1+ a2 × B2+ A3 × B3+ a1 × B1+. said product is obtained by dividing the sum of squares of a by the number of roots (the sum of squares of B)

Value = 26/root (30) root (27)

Value = 26/28.4604

=0.9135

The larger the value is, the more similar, =1 represents the exact same. =0 represents a complete difference

Listing standard library data similar to the list data, manually confirming which data is matched with the list data, and directly replacing the list data with the standard library data by using an update method of sql after confirmation.

The cleaned data list is derived using the POI technique of java.

Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A method for cleaning EXCEL data source based on big data is characterized by comprising the following steps:

analyzing and structuring an EXCEL data source;

cleaning the standardized EXCEL data source;

2. The EXCEL data source cleaning method based on big data according to claim 1, parsing and structuring EXCEL data source comprising:

establishing a corresponding relation between the field and the title;

and storing the data of the EXCEL data source into a database.

3. The EXCEL data source cleaning method based on big data according to claim 1, characterized in that the normalization process of key attribute names to the data in the resolved and structured EXCEL data source is to match the key field data in the EXCEL data source with the standard data.

4. The EXCEL data source cleaning method based on big data according to claim 1, characterized in that cleaning the normalized EXCEL data source comprises:

preprocessing data in the EXCEL data source;

5. The EXCEL data source cleaning method based on big data according to claim 4, characterized in that the preprocessing comprises:

and checking the mobile phone number by using a regular expression.

6. The EXCEL data source cleaning method based on big data according to claim 5, characterized in that the corresponding data in the standard library is found from the key fields in the EXCEL data source using cosine value algorithm, wherein the cosine value algorithm is:

；

7. The EXCEL data source cleaning method based on big data according to claim 6, characterized in that the standard library data corresponding to the data in EXCEL data source is listed, the standard data matching the data in EXCEL data source is confirmed, after confirmation, the data in EXCEL data source is directly replaced with the data of standard library using sql update method.

8. An EXCEL data source cleaning system based on big data, comprising:

9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.