CN110019153B

CN110019153B - Multi-type batch data processing system and processing method thereof

Info

Publication number: CN110019153B
Application number: CN201710822561.3A
Authority: CN
Inventors: 聂妍
Original assignee: Beijing Chenxin Credit Information Co ltd
Current assignee: Beijing Chenxin Credit Information Co ltd
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2022-03-04
Anticipated expiration: 2037-09-13
Also published as: CN110019153A

Abstract

The invention discloses a multi-type batch data processing system and a processing method, wherein the system comprises an original data file part, a data file format conversion part, a data file classification conversion part, a data file structuring processing part, a data cleaning processing part and a data theme storage part. The method comprises the following steps: 1. carrying out format conversion on the data file in the original data file part by using a data file format conversion part; 2. classifying the data files processed by the data file format conversion part by using a data file classification conversion part, and storing the data files; 3. carrying out structuring processing on the processable file type in the step 2 by using a data file structuring processing part; 4. cleaning the structured data file by using a data cleaning processing part; 5. the data theme storage part is used for classifying and storing according to the theme of the data file. The system or the method of the invention is simple, and can realize the processing of multi-type and batch data files in a short time.

Description

Multi-type batch data processing system and processing method thereof

Technical Field

The present invention relates to the field of data processing, and in particular, to processing of multi-type batch data, and more particularly, to a multi-type batch data processing system and a processing method thereof.

Background

The advent of the internet information age and the advent of big data technologies have created an unprecedented data flood. Many stakeholders expect that data can be obtained by sharing data interchange, but the data exchange process is greatly hindered by different data storage modes, different data structures and numerous data sources.

Cleaning work is required before data storage and warehousing, but at present, the data cleaning work is mainly manual cleaning, so that the outstanding problems of time and labor consumption, low efficiency, huge difference and the like exist, manual operation is difficult to form a standard, and secondary data pollution is possibly caused. Especially, if multi-type (complex) batch data is processed manually, the time and the labor are seriously consumed, the data cannot be rapidly and efficiently applied, the data application value density is too low, and the progress of data exchange sharing and rapid application is greatly hindered.

Disclosure of Invention

In order to overcome the above problems, the present inventors have made intensive studies to obtain a multi-type batch data processing system and a processing method thereof, thereby completing the present invention.

One aspect of the present invention provides a multi-type batch data processing system, which is embodied in the following aspects:

(1) a multi-type batch data processing system, comprising

An original data file part 001 for storing a plurality of types of batch data files to be processed;

a data file format conversion unit 002 for performing format path conversion on the batch data files in the original data file unit 001;

a data file classification converting section 003 for classifying the batch data files converted by the data file format converting section 002 into a processable file type file set and an unprocessable file type file set; and

and a data file structuring processing unit 004 for performing structuring processing on the obtained processable file type to obtain a data file capable of being stored in a structured manner.

(2) The data processing system according to the above (1), further comprising a data cleaning processing part 005 and a data topic storage part 006, wherein the data cleaning processing part 005 is configured to perform cleaning processing on the obtained data file which can be stored in a structured manner; the data topic storage unit 006 is used to classify and store the data after the cleaning process.

The invention provides a method for processing multi-type batch data, which is embodied in the following aspects:

(3) a method for processing multi-type batch data, preferably using the system of any one of claims 1 to 5, wherein the method comprises the steps of:

step 1, using data file format conversion part 002 to convert the format of the data file in original data file part 001;

step 2, the data file classification conversion part 003 is used for classifying and storing the data files processed by the data file format conversion part 002;

step 3, the data file structuring processing part 004 is used for structuring the processable file type in the step 2 to obtain a data file which can be stored in a structuring way;

step 4, the data cleaning processing part 005 is used for cleaning the obtained data file which can be stored in a structured way;

and step 5, classifying and storing according to the theme of the data file by using the data theme storage part 006.

Drawings

FIG. 1 shows a schematic of the framework of the system of the present invention;

FIG. 2 shows a flow chart of the method of the present invention;

FIG. 3 illustrates a schematic diagram of multi-type batch data processed by an embodiment;

FIG. 4 shows the processing procedure and the processing result of step 1 in example 1;

FIG. 5 shows the results of the treatment of step 2 in example 1;

fig. 6 shows the processing results of step 3 and step 4 in example 1.

Detailed Description

The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.

An aspect of the present invention provides a multi-type batch data processing system, as shown in fig. 1, the system includes an original data file section 001, a data file format conversion section 002, a data file classification conversion section 003, a data file structuring processing section 004, a data cleansing processing section 005, and a data topic storage section 006.

The original data file part 001 is used for storing multi-type batch data files to be processed; the data file format conversion part 002 is used for converting the format and the path of the batch data files in the original data file part 001; the data file classification converting part 003 is configured to classify the batch data files converted by the data file format converting part 002 into processable data files and non-processable data files; the data file structuring processing unit 004 is configured to perform structuring processing on the obtained processable data file to obtain a data file capable of being stored in a structured manner; the data cleaning processing unit 005 is configured to perform cleaning processing on the obtained data file that can be stored in a structured manner; the data theme storage 006 is used to classify and store the data files after the cleaning process.

In the invention, the objects processed by the system are multi-type batch data files, that is, the data files contain multi-type files, for example, comprising EXCEL files, SQL script files, text files containing CSV and TXT, and the like; and the data file is very large, at least above class T. This is not at all relevant in the prior art, where the data processing is typically of the same type of data or a small amount of data that is small in quantity.

According to a preferred embodiment of the present invention, as shown in fig. 1, the data file format converting part 002 includes a data file format converting module 021, a data file path converting module 022 and a data file deduplication converting module 023.

Wherein:

the data file format conversion module 021 is configured to perform format conversion on a data file in the original data file section 001, and specifically, (1) decompress an original compressed data file, and mark the data file that cannot be decompressed as an unconventional data file or delete the data file directly, (2) determine whether the data file in the unified subfile directory is a split file, if yes, merge and restore the split text format data file into the original data file, and (3) determine whether an erroneous suffix exists, and mark the data file that is determined as the suffix of the erroneous file as an unconventional data file or delete the data file;

the file path conversion module 022 is configured to perform path conversion on the data file after the format conversion module 021 converts the format of the data file, and specifically, (1) extract the data file in multiple subfolders into a main directory, (2) delete redundant multiple subfolders after extraction, and (3) mark the data file that cannot be moved or operated as an unqualified data file or delete;

the data file deduplication conversion module 023 is configured to perform deduplication processing on the data files after the path conversion, determine whether the stored data files are duplicate data files, and process and filter the duplicate data files, and specifically, (1) determine, according to the name and size of the data files, that the name and content of the data files are the same and/or similar, and store data files or data file sets of the same and/or similar size, determine that the data files are duplicate data files, and mark the data files as non-compliant data files or delete the data files; (2) and judging according to the content of the data file, if the first 10 rows of data of the content of the data file are completely the same and the data file or the data file set with the same and/or similar size is stored, marking the data file as a repeating data file and marking the data file as an unqualified data file or deleting the data file.

According to a preferred embodiment of the present invention, the data file classification converting part 003 includes a data file classification module 033, a processable file module 031 and an unprocessed file module 032.

Wherein:

the data file classification module 033 is configured to classify the data file processed by the data file format conversion unit 002 into a processable data file and an unprocessable data file, where the processable data file includes an excl file, a database export file, a text file, an SQL script file, and the like, and the unprocessable data file includes a word file, a PDF file, an audio file, a video file, and the like;

the processable file module 031 is configured to store processable data files obtained by the file data classification module 033, i.e., data files that can be structured;

the unprocessed file module 032 is configured to store the unprocessed data files obtained by the file data classification module 033, that is, the data files that cannot be processed in a structured manner.

According to a preferred embodiment of the present invention, the data cleansing processing part 005 includes a data content rule module 051 and a data storage rule module 052.

Wherein:

the data content rule module 051 is used for checking whether the data content is in compliance, and cleaning the data of the non-compliance data, namely marking the data as the non-compliance data or deleting the data, wherein the non-compliance data comprises the following components: (1) characters except for Chinese characters, English letters, Arabic numerals and common punctuations, (2) messy code characters, (3) null values existing in non-null fields in data, (4) all column contents existing in the data are repeated, (5) key column contents existing in the data are repeated;

the data storage rules module 052 is used to check whether the content of single-line and/or multi-line data corresponds to the column in which the data is located, check whether blank line data exists in the data, and mark the data as non-compliant data or delete the data.

Whether the content of the single-row data corresponds to the column where the single-row data is located or not includes whether the column is staggered or not (namely, the different column positions in a row are disordered); whether the content of the multi-line data corresponds to the column thereof includes whether a line is changed into a plurality of lines by a line feed character which should not exist and whether the column is dislocated.

According to a preferred embodiment of the present invention, the data topic storage 006 includes a data topic library 061 and a data classification recording component 062.

The data topic library 061 is configured to divide the data files cleaned by the data cleaning processing unit 005 into different topic libraries according to different topics, where the topics include personal information topics, enterprise information topics, attribute topics (including vehicles, airplanes, daily necessities, and the like); the data classification recording component 062 is used to record topic classifications generated by the data topic library 061.

In this way, the data is classified and stored according to different subjects, and if data of the category of "vehicle" is adopted at a later stage, for example, the data of the category of "vehicle" is adopted directly in the data classification recording component 062.

Another aspect of the present invention provides a method for processing multi-type batch data, as shown in fig. 2, the method includes the following steps:

in step 1, the data file format conversion section 002 converts the format of the data file in the original data file section 001.

The data files in the original data file section 001 are multi-type batch data files, and include, for example, EXCEL files, SQL script files, text files containing CSV and TXT, and the like.

According to a preferred embodiment of the invention, step 1 comprises the following sub-steps:

step 1.1, carrying out format conversion on the data file in the original data file part 001 by using a data file format conversion module 021;

step 1.2, a file path conversion module 022 is utilized to convert the data file path of the data file after format conversion, and preferably, the data files in various subfolders are extracted to a main directory;

step 1.3, the data file deduplication conversion module 023 is used for performing deduplication processing on the data file after the path conversion, judging whether the stored data file is a duplicate data file, and processing and filtering the data file.

According to a preferred embodiment of the invention, in step 1.1, the format conversion is performed as follows: and decompressing the original compressed data file, and marking the data file which cannot be decompressed as an unqualified data file or deleting the data file.

In a further preferred embodiment, in step 1.1, the format conversion is also performed as follows: and judging whether the data files in the unified subfile directory are split files or not, and merging and restoring the split text format data files into original data files.

In a further preferred embodiment, in step 1.1, the format conversion is also performed as follows: and judging whether an error file suffix exists or not, and marking the data file judged as the error file suffix as an unqualified data file or deleting the data file.

The data file is converted into a file which can be operated through format conversion.

According to a preferred embodiment of the invention, in step 1.2, step 1.2 comprises the following sub-steps:

step 1.2.1, extracting data files in various subfolders to a main directory;

step 1.2.2, deleting redundant multiple subfolders after extraction;

and 1.2.3, marking the data file which cannot be moved or operated as an unqualified data file or deleting the data file.

In the data files to be processed, there are file formats and folder formats, and there may be folders in the folders, so that they may not be in a directory hierarchy, and step 1.2 refers all the data files to a directory hierarchy for subsequent processing.

According to a preferred embodiment of the invention, in step 1.3, the deduplication process is performed according to the data file name and size, or according to the data file content.

In a further preferred embodiment, the data file names are judged to be identical and/or similar in content according to the data file names and sizes, and the data files or data file sets with the same and/or similar stored sizes are judged to be duplicate data files and marked as non-compliant data files or deleted.

In a further preferred embodiment, if the first 10 rows of data file contents are identical and data files or data file sets of the same and/or similar size are stored, as judged by the data file contents, then the data file is a duplicate data file, marked as an illegitimate data file or deleted.

The purpose of step 1.3 is to remove duplicate data files.

And step 2, the data file classification conversion part 003 classifies and stores the data files processed by the data file format conversion part 002.

According to a preferred embodiment of the invention, step 2 comprises the following sub-steps:

step 2.1, classifying the data files processed by the data file format conversion part 002 by using a data file classification module 033 into processable data files and non-processable data files;

step 2.2, storing the obtained processable data file by adopting a processable file type module 031;

and 2.3, storing the obtained unprocessable data file by adopting an unprocessable file type module 032.

The processable data files, namely the data files which can be processed in a structured mode, comprise excl files, database export files, text files, SQL script files and the like, and the non-processable data files comprise word files, PDF files, audio files, video files and the like.

And 3, carrying out structuring processing on the processable file type in the step 2 by using the data file structuring processing part 004 to obtain the data file which can be stored in a structuring mode.

The data file is converted into data which is arranged and stored in a row and column mode through structuring processing, namely formatting processing. Specifically, the structuring process, i.e. the formatting process, is to convert the data file into data expressed and implemented logically by a two-dimensional table structure, strictly following the data format and length specification, and mainly storing and managing through a relational database.

And 4, performing cleaning processing on the obtained data file capable of being stored in a structured mode by using the data cleaning processing part 005.

According to a preferred embodiment of the invention, step 4 comprises the following sub-steps:

step 4.1, using the data content rule module 051 to check whether the data content is in compliance, and performing data cleaning on the non-compliance data, namely marking the data as the non-compliance data or deleting the data;

step 4.2, using the data storage rule module 052 to check whether the content of the single-row and/or multi-row data corresponds to the column in which the single-row and/or multi-row data is located;

and 4.3, checking whether blank row data exist in the data, and marking the data as non-compliant data or deleting the data.

Wherein, in step 4.1, the non-compliance data comprises: (1) characters except for Chinese characters, English letters, Arabic numerals and common punctuations, (2) messy code characters, (3) null values existing in non-null fields in data, (4) all column contents existing in the data are repeated, (5) key column contents existing in the data are repeated; in step 4.2, whether the content of the single row of data corresponds to the column where the data is located includes whether a column misalignment exists (i.e., a situation where different columns in a row are misaligned); whether the content of the multi-line data corresponds to the column thereof or not comprises the condition that whether a line is changed into a plurality of lines by a line feed character which should not exist or whether the column is staggered or not; in step 4.3, the blank row data refers to that the whole row or the whole column is blank, and has no actual content.

According to a preferred embodiment of the invention, step 5 comprises the following sub-steps:

step 5.1, dividing the data files cleaned by the data cleaning processing part 005 into different theme libraries according to different themes by using a data theme library 061;

and 5.2, recording the topic classification generated by the data topic library 061 by using the data classification recording component 062.

In step 5, the processed data files may be classified and stored according to different topics, where the topics include personal information topics, enterprise information topics, attribute topics (including vehicles, airplanes, daily necessities, and the like), and the specific classification manner is determined according to business or research needs.

In the prior art, no relevant reports about multi-type and batch data processing are provided, and in practical application, manual operation is adopted, so that not only is a long time, generally more than 6 months, required, but also labor cost is wasted.

However, the system or the method of the invention can realize automatic processing for processing various types and batch data, generally needs about one month or even two weeks, and the key is that no manual operation is needed, thus not only greatly shortening the processing time, but also saving the labor cost.

The invention has the advantages that:

(1) the system or the method of the invention is simple, can realize the processing of various types and batch data files, and obtains high-availability, high-purity and standard structured data;

(2) the system is a modular component for data processing, has strong processing availability and portability, and can be conveniently and quickly applied to other data systems to provide high-quality data services;

(3) the system or the method greatly shortens the processing time of the multi-type and batch data files;

(4) the system or the method saves labor cost.

Examples

The multi-type batch data as shown in fig. 3 is processed, and has a size of 100T. As shown in fig. 3(a), the data includes multiple types of data such as folders and compressed packages, each folder has multiple subfolders, and each subfolder has files and/or folders (where, due to the large amount of data, fig. 3(a) does not show all the files, but only shows a part of the files). For example, as shown in fig. 3 (b), the folder "CoCo" contains 127 subfolders, and as shown in fig. 3 (c), the first folder "121376" next to the subfolders contains data and folders. Therefore, the system or method described herein processes multi-type batch data files.

Step 1, decompressing the compressed packet in fig. 3(a) by using a data file format conversion module 021, extracting data files in a plurality of subfolders into a main directory by using a file path conversion module 022, and finally performing deduplication processing on the data files after path conversion by using a data file deduplication conversion module 023, wherein the result is shown in fig. 4;

step 2, classifying the data files processed by the data file format conversion part 002 by using a data file classification module 033, dividing the data files into processable data files and non-processable data files, and storing the data files, wherein the result is shown in fig. 5;

step 3, the data file structuring processing unit 004 is used to structure the processable file type in step 2 to obtain a data file which can be stored in a structured manner, and the result is shown as step 3 in fig. 6; however, there is a phenomenon of significant column misalignment;

step 4, using the data content rule module 051 to check whether the data content is in compliance, and performing data cleaning on the non-compliance data, namely marking the non-compliance data as the non-compliance data or deleting the non-compliance data, wherein the non-compliance data comprises the following steps: (1) characters except for Chinese characters, English letters, Arabic numerals and common punctuations, (2) messy code characters, (3) null values existing in non-null fields in data, (4) all column contents existing in the data are repeated, (5) key column contents existing in the data are repeated;

and the data storage rule module 052 is used for checking whether the content of the single-row and/or multi-row data corresponds to the column where the single-row and/or multi-row data is located, and correcting the condition of column dislocation;

then checking whether blank line data exist in the data, and marking the data as non-compliant data or deleting the data, wherein the result is shown as step 4 in FIG. 6;

and 5, dividing the data files cleaned by the data cleaning processing part 005 into different theme libraries according to different people, places, events, objects and organization units by using the data theme library 061, and recording the theme classification generated by the data theme library 061 by using the data classification recording component 062 so as to directly use the data in the following process.

The whole process needs 12 days, and when different data files are processed, the specific number of days changes along with different conditions of the data files.

In contrast, the multi-type batch data in the above embodiment is manually processed and manually checked, and the whole process takes 180 days.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. A method for processing multi-type batch data by using a multi-type batch data processing system is characterized in that,

the data processing system includes:

an original data file part (001) for storing multi-type batch data files to be processed;

a data file format conversion part (002) for performing format conversion and path conversion on the batch data files in the original data file part (001);

a data file classification conversion unit (003) for classifying the batch data files converted by the data file format conversion unit (002) into processable data files and unprocessable data files; and

a data file structuring processing unit (004) for performing structuring processing on the obtained processable data file to obtain a data file capable of being stored in a structured manner;

the data file format conversion unit (002) includes:

the data file format conversion module (021) is used for carrying out format conversion on the data file in the original data file part (001);

a data file path conversion module (022) for performing path conversion on the data file after the format conversion module (021) converts the format; and

a data file duplicate removal conversion module (023) for performing duplicate removal processing on the data file after the path conversion, judging whether the stored data file is a repeated data file, and processing and filtering the repeated data file;

the data file classification conversion unit (003) includes:

a data file classification module (033) for classifying the data files processed by the data file format conversion part (002) into processable data files and non-processable data files;

the file processing module (031) is used for storing the data files which can be processed and are obtained by the file data classification module (033), namely the data files which can be processed in a structured mode; and

the unprocessed file module (032) is used for storing the unprocessed data files obtained by the file data classification module (033), namely the unprocessed data files which cannot be processed in a structured mode;

the system further comprises a data cleaning processing part (005) and a data subject storage part (006), wherein the data cleaning processing part (005) is used for cleaning the obtained structuralized data file; the data theme storage part (006) is used for classifying and storing the cleaned data files;

the data cleaning processing part (005) comprises a data content rule module (051) and a data storage rule module (052), wherein: the data content rule module (051) is used for checking whether the data content is in compliance or not and cleaning the data which is not in compliance, namely marking the data which is not in compliance or deleting the data; the data storage rule module (052) is used for checking whether the content of single-row and/or multi-row data corresponds to the column, checking whether blank row data exists in the data, and marking the data as non-compliant data or deleting the data;

the data topic storage (006) includes a data topic library (061) and a data classification record component (062), wherein: the data subject library (061) is used for dividing the data files cleaned by the data cleaning processing part (005) into different subject libraries according to different subjects; the data classification recording component (062) is used for recording the topic classification generated by the data topic library (061);

the method comprises the following steps:

step 1, a data file format conversion part (002) is used for carrying out format conversion on a data file in an original data file part (001);

step 2, classifying and processing the data files processed by the data file format conversion part (002) by using a data file classification conversion part (003) and storing the data files;

step 3, a data file structuring processing part (004) is used for structuring the processable file type in the step 2 to obtain a data file which can be stored in a structuring way;

step 4, cleaning the obtained data file which can be stored in a structured way by a data cleaning processing part (005);

step 5, classifying and storing according to the theme of the data file by using a data theme storage part (006);

the data files in the original data file part (001) are multi-type batch data files, and comprise EXCEL files, SQL script files and text files containing CSV and TXT;

step 1 comprises the following substeps: step 1.1, carrying out format conversion on the data file in the original data file part (001) by using a data file format conversion module (021);

step 1.2, a file path conversion module (022) is utilized to convert the data file after format conversion into a data file path, and preferably, the data files in various subfolders are extracted into a main directory;

step 1.3, carrying out duplicate removal processing on the data file after the path conversion by using a data file duplicate removal conversion module (023), judging whether the stored data file is a repeated data file, and processing and filtering the data file;

in step 1.1, the format conversion proceeds as follows: decompressing the original compressed data file, and marking the data file which can not be decompressed as an unqualified data file or deleting the data file; judging whether the data files in the unified subfile directory are split files or not, and merging and restoring the split text format data files into original data files; judging whether an error file suffix exists or not, and marking the data file which is judged as the error file suffix as an unqualified data file or deleting the data file;

in step 1.2, step 1.2 comprises the following sub-steps:

step 1.2.1, extracting data files in various subfolders to a main directory;

step 1.2.2, deleting redundant multiple subfolders after extraction;

step 1.2.3, marking the data file which can not be moved or operated as an unqualified data file or deleting the data file;

in step 1.3, carrying out duplicate removal processing according to the name and size of the data file or carrying out duplicate removal processing according to the content of the data file; judging according to the name and the size of the data file, judging that the name and the content of the data file are the same and/or similar, and judging that the stored data file or the data file set with the same and/or similar size is a repeated data file, and marking the repeated data file as an unqualified data file or deleting the data file; judging according to the content of the data file, if the first 10 rows of data of the content of the data file are completely the same and the data file or the data file set with the same and/or similar storage size is judged, marking the data file as a repeating data file and marking the data file as an unqualified data file or deleting the data file;

step 2 comprises the following substeps:

step 2.1, classifying the data files processed by the data file format conversion part (002) by using a data file classification module (033) into processable data files and non-processable data files;

step 2.2, storing the obtained processable data file by adopting a processable file type module (031);

step 2.3, storing the obtained unprocessable data file by adopting an unprocessable file type module (032);

the processable data files can be structured data files and comprise excl files, database export files, text files and SQL script files, and the non-processable data files comprise word files, PDF files, audio files and video files;

step 4 comprises the following substeps:

step 4.1, using a data content rule module (051) to check whether the data content is in compliance, and carrying out data cleaning on the non-compliance data, namely marking the non-compliance data as the non-compliance data or deleting the non-compliance data;

step 4.2, checking whether the content of the single-row and/or multi-row data corresponds to the column of the single-row and/or multi-row data by using a data storage rule module (052);

4.3, checking whether blank line data exist in the data, and marking the data as non-compliant data or deleting the data;

step 5 comprises the following substeps:

step 5.1, dividing the data files cleaned by the data cleaning processing part (005) into different theme libraries according to different themes by using a data theme library (061);

step 5.2, recording the theme classification generated by the data theme library (061) by using a data classification recording component (062);

wherein, in step 5.1, the topics comprise personal information topics, business information topics and attribute topics.

2. The method of claim 1,

in step 4.1, the non-compliance data includes: (1) characters except for Chinese characters, English letters, Arabic numerals and common punctuations, (2) messy code characters, (3) null values existing in non-null fields in data, (4) all column contents existing in the data are repeated, (5) key column contents existing in the data are repeated;

in step 4.3, the blank row data refers to that the whole row or the whole column is blank, and has no actual content.