CN110990384A

CN110990384A - Big data platform BI analysis method

Info

Publication number: CN110990384A
Application number: CN201911066534.3A
Authority: CN
Inventors: 闻小明
Original assignee: Wuhan Sinocare Technology Co ltd
Current assignee: Wuhan Sinocare Technology Co ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-04-10
Anticipated expiration: 2039-11-04
Also published as: CN110990384B

Abstract

The invention provides a BI analysis method of a big data platform, which can remove a small amount of classified samples or combine a small amount of classified sample data by adding recognition error classification in data cleaning, and grasp the overall trend of data in the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis; by adding the center and the dispersion measurement in data cleaning, when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can estimate or calculate future data information according to the algorithm.

Description

Big data platform BI analysis method

Technical Field

The invention relates to the field of big data, in particular to a BI analysis method for a big data platform.

Background

BI analysis is a process of loading data of a business system to a data warehouse after extraction, cleaning and conversion, and with the development of scientific and technical information, the problem of data quality is always a problem of close attention in the data mining process. Data cleaning is the most resource-consuming step in the data mining process, and how to effectively clean and convert data into a data source meeting the data mining requirements is a key factor influencing the data mining accuracy. The existing data cleaning method comprises missing value processing, abnormal value processing, duplication removing processing and noise data processing, and the methods can screen and remove duplicated and redundant data, complement missing data completely, correct or delete wrong data, and finally arrange the missing data into data which can be further processed and used. However, for the situation that the data quality requirement is high, the traditional data cleaning method cannot meet the data quality requirement, so that the assistant decision given by the BI analysis system is not accurate. Therefore, the invention provides a BI analysis method for a big data platform, which can improve the data quality of data cleaning and further provide more accurate analysis results.

Disclosure of Invention

In view of this, the invention provides a big data platform BI analysis method, which can improve the data quality of data cleaning and further provide a more accurate analysis result.

The technical scheme of the invention is realized as follows: the invention provides a BI analysis method for a big data platform, which comprises the following steps:

s1, constructing a data extraction component, and extracting data in the database or the text document by using the data extraction component;

s2, analyzing the target and the requirement according to the collected data, and creating a data model table structure;

s3, carrying out data cleaning and data transformation processing on the collected data, wherein the data cleaning comprises missing data processing, error classification identification, center and scatter measurement and outlier identification;

s4, describing potential patterns and trends of data through description algorithms, classifying the data by using classification algorithms, classifying a group of data into a plurality of categories according to the similarity and difference of the data by using clustering algorithms, generating association or correlation among data items by using association rules, and estimating and predicting by using statistical algorithms for estimation and prediction.

On the basis of the above technical solution, preferably, the data extraction component in S1 includes a general relational database extraction component and a general text extraction component;

the construction method of the general relational database extraction component comprises the following steps:

s101, constructing a database connection configuration object;

s102, constructing a database connection operation interface;

s103, constructing a database basic information query interface;

s104, querying data information by using a JPA technology;

the construction method of the universal text extraction component comprises the following steps:

s201, constructing a text connection configuration object;

s202, constructing a text connection operation interface;

s203, constructing a text information reading interface.

On the basis of the above technical solution, preferably, the missing data processing in S3 specifically includes the following steps:

s301, replacing a missing value by inserting a mean value or inserting classification information according to proportion under the condition of missing recorded data;

s302, when the current record is associated with a plurality of table data, the associated table data information is counted, the associated data type, the data amount and the percentage of each classification in the table data information are obtained, the mean range of the missing data is calculated, and the missing value is inserted according to the mean range, the percentage and the time variable of the missing data.

Further preferably, identifying the misclassification comprises the following steps:

s401, identifying classification information through a classification Util classification identification method, a classification table or a classification dictionary data identification method;

s402, respectively calculating the quoting condition of each category of information in the data according to the quoting classification information of the data samples in each data table, namely obtaining the quoting information of each classification in the whole data;

and S403, classifying the total data of all the data samples into a small number of classification information under the condition that the total data of all the classification reference data is lower than 1%.

Further preferably, the classificationUtil classification and identification method includes the following steps:

s501, acquiring all data information of a sample table;

s502, calculating data in each line of data in a hashMap key value pair mode, and automatically counting and adding one in a hashMap set, so that a repeated value of each line of data is obtained after all data are calculated;

and S503, when the repetition value is lower than 30, acquiring the classification data information in the high-frequency short data sample and normalizing the classification data information into classification information.

Further preferably, the classification table or classification dictionary data identification method includes the steps of:

s601, acquiring all data of a classification table or a dictionary, and identifying classification names, classification numbers, classification codes and paths in the data;

s602, obtaining repeated data citation according to a classificationUtil classification identification method, and analyzing a classification number or a classification code and a higher classification number or a classification code according to a data counting value;

s603, for a single data dictionary or classification table, a single upper-level code corresponds to a plurality of lower-level codes, and the coded data is frequently referred in other data tables, namely, the coded data can be analyzed into single-type classification information or classification tables.

Further preferably, the center and spread metrics comprise the steps of:

s701, acquiring all data information of the form, and calculating a data value range of the field type according to digital type data in the data information;

s702, acquiring the repeated occurrence times of different data by using a hashMap object according to the data value range and the repeated occurrence times of different data, and calculating the repeated times of each segment of data in a segmented manner according to the total number of the acquired values;

s703, acquiring a main central point and measurement information, namely an average value number and a median, according to the repetition times of the segmented data;

and S704, deducing the user interest points according to the conditions of time, mean value and data range when the mean value number and the median are taken as standards and data sample data of the same type are counted.

Further preferably, the data transformation in S3 includes min-max normalization and Z-score normalization;

the min-max normalized working mode is as follows: observing a difference between the field value and the minimum value and scaling the difference by a range;

the working mode of Z-score normalization is: the difference between the field value and the field mean is captured and scaled by the standard deviation SD of the field value.

On the basis of the above technical solution, preferably, the statistical algorithm for estimation and prediction in S4 includes a point estimation method, an interval estimation method, and an assumption verification method.

Compared with the prior art, the big data platform BI analysis method has the following beneficial effects:

(1) by adding recognition error classification in data cleaning, a small amount of classified samples can be removed or combined, and the overall trend of the data is grasped from the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis;

(2) by adding the center and the dispersion measurement in data cleaning, when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can estimate or calculate future data information according to the algorithm;

(3) because different government affair platforms only process related main business of the platform, different business data are accumulated by each platform according to business characteristics of the platform, and therefore a large amount of Excel form documents and business data are accumulated. The data configuration information can be constructed through the data extraction component, and the data configuration information is connected with the database or the text file, so that the data in the database and the data in the Excel form document can be extracted, the service data of each platform is integrated, and the data barrier is broken through.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a big data platform BI analysis method of the present invention;

FIG. 2 is a flow chart of a method for constructing a generic relational database extraction component in a big data platform BI analysis method of the present invention;

FIG. 3 is a flow chart of a method for constructing a universal text extraction component in a big data platform BI analysis method according to the present invention;

FIG. 4 is a flow chart of missing data processing in a big data platform BI analysis method of the present invention;

FIG. 5 is a flow chart of the identification of error classifications in a big data platform BI analysis method of the present invention;

FIG. 6 is a flow chart of the classificationUtil classification identification method of FIG. 5;

FIG. 7 is a flow chart of a method of identifying classification table or classification dictionary data in FIG. 5;

FIG. 8 is a flow chart of the hub and scatter metrics in a big data platform BI analysis method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in FIG. 1, the big data platform BI analysis method of the present invention comprises the following steps:

The beneficial effect of this embodiment does: the accuracy of data mining can be effectively improved by identifying the wrong classification, a small amount of classified samples are removed or combined, and the overall trend of the data is grasped from the data integrity; in addition, the method can help a user to eliminate the influence of a small number of data samples on the trend of the whole data when the data model is created;

mean, median and mode can be obtained by the center and spread metrics; the mean value is an important part in data mining, corresponding data mining algorithms can be drawn up according to user interest points and data mining directions, different data algorithms need to be calculated by referring to different mean values in a data set, and therefore the estimated data which we want is estimated or analyzed, and the median and the mode are the same;

the potential patterns and trends of data can be described through a description algorithm, the classification algorithm divides the data in the database into different classes according to common characteristics, a group of data is divided into a plurality of different classes according to the similarity and difference of the data through a clustering algorithm, the similarity between the data belonging to the same class is large, the similarity between the data belonging to the same class is small, the cross-class data association is low, the occurrence of other data items can be deduced according to the occurrence of one data item through an association rule, the relation and the rule existing in the data are discovered through an estimated and predicted statistical algorithm, and the future development trend is predicted according to the existing data.

Example 2

The embodiment provides a data extraction mode on the basis of the embodiment 1. The data extraction component is used for extracting data and realizing the butt joint of the data and the system. The data docking needs to include Excel table data reading capability and data extraction capability of traditional relational databases such as MySQL, Oracle and SQLServer. By the capability, data information based on the analysis method can be extracted, and a data base can be provided for the next step.

The data extraction component comprises a general relational database extraction component and a general text extraction component.

As shown in fig. 2, the method for constructing the general relational database extraction component includes the following steps:

s101, constructing a database connection configuration object;

s102, constructing a database connection operation interface;

s103, constructing a database basic information query interface;

s104, querying data information by using a JPA technology;

as shown in fig. 3, the method for constructing the universal text extraction component includes the following steps:

s201, constructing a text connection configuration object;

s202, constructing a text connection operation interface;

s203, constructing a text information reading interface.

Taking a village service cloud platform as an example, reading data information of the platform, constructing a relational database docking assembly by adopting a JPA (Java Persistence API) technology, and directly reading database service data information after configuring different database connection information. And the Excel file reading mode adopts a POI component of Apache and expands the file reading mode.

The beneficial effect of this embodiment does: because different government affair platforms only process related main business of the platform, different business data are accumulated by each platform according to business characteristics of the platform, and therefore a large amount of Excel form documents and business data are accumulated. In the embodiment, data configuration information can be constructed through the data extraction component, and the database or the text file is connected, so that data in the database and data in the Excel form document can be extracted, service data of each platform can be integrated, and data barriers can be opened.

Example 3

On the basis of embodiment 1 or embodiment 2, the present embodiment provides a data modeling method. And data are acquired according to the general data extraction component, and a general storage scheme is provided for the acquired data information, namely, the data basis modeling is carried out. The collected data information is stored by adopting a standard data model.

Data modeling is a process of an information system for defining and analyzing the requirements of data and the corresponding support it needs. Data modeling defines not only data elements, but also their structure and relationships between them.

The data modeling is realized by the following steps:

s801, analyzing a target according to user data, and combing analysis requirements;

s802, establishing a data model table structure by using a data modeling component;

s803, collecting data information by using a data extraction component;

s804, converting the acquired data information by using a data processing assembly;

and S805, saving the converted data information by using the data storage component.

The beneficial effects of this embodiment do: the method can acquire original data source table structure information according to different data sources, further construct a data model table structure, and after construction is completed, system developers can fill data in the data model table structure, so that data integration is further completed, and a data basis is provided for data processing and mining in the next step.

Example 4

On the basis of embodiment 3, the present embodiment provides an embodiment of data modeling, and the present embodiment takes a rural data platform as an example. According to the modeling method of embodiment 3, the modeling method is described in detail in this embodiment.

specifically, data are integrated and stored, and a special subject database is built according to requirements.

S802, establishing a data model table structure by using a data modeling component; the method specifically comprises the following four steps:

the first step is to fulfill the basic data integration requirement. The requirements need to read data source original data information through a data acquisition assembly to acquire original data source table structure information.

And secondly, completing population data integration. The demand mainly integrates population data and associated data information, basic data is imported in a similar basic data integration mode, and an imported data copy model is created. The important point of attention here is that the association set model is constructed from the key information of the population data, such as the key information of name, identification number, main key of population information, etc. The association set model mainly comprises: the method comprises the steps of a main key, creation time, updating time, name, identification card number, data source main key and association table set; according to the model, the population associated data information can be effectively integrated, and support is provided for subsequent data analysis.

And thirdly, completing thematic data integration. The method mainly integrates thematic data and associated data information, basic data are imported in a similar basic data integration mode, and an imported data copy model is created. The key point of attention here is the key word information of the thematic data, such as key information of thematic names, thematic numbers, thematic data main keys, thematic time and the like, and the communication association set model.

The model mainly comprises: key, creation time, update time, topic name, topic number, topic key, topic time, and association table set. And providing associated data information for the event topic information according to the model, and simultaneously supporting subsequent data analysis services.

S803, collecting data information by using a data extraction component;

Example 5

On the basis of embodiment 3 or 4, the present embodiment provides a specific method of data cleansing and data transformation; the data cleaning process "cleans up" data by filling in missing values, smoothing noisy data, identifying or deleting outliers and resolving inconsistencies, mainly to achieve the goals of format standardization, abnormal data clean-up, error correction, and duplicate data clean-up. Data transformation uses normalization techniques to normalize numerical variables of data in order to normalize the degree of influence of each variable on the result.

In this embodiment, data cleansing includes missing data processing, identifying error classifications, center and spread metrics, identifying outliers. The present embodiment specifically describes the working principle of missing data processing.

Missing data is a problem that continues to plague data analysis methods, and even as analysis methods become more sophisticated, we can nevertheless encounter the problem of missing field values, particularly in databases that have large numbers of fields. The absence of information is extremely detrimental to data analysis, and under equivalent conditions, the more information the better in general. Therefore, the method of selecting the replacement value is selected to process the missing value in the present embodiment. As shown in fig. 4, the method specifically includes the following steps:

The beneficial effect of this embodiment does: the problem of relevant missing values in common basic data can be processed through missing data, and the overall regularity of the collected data is improved.

Example 6

On the basis of embodiment 5, the present embodiment provides a specific step of identifying the error classification.

Most of the problems with bad models stem from not well handling typed variables. There are three types of processing methods for categorical variables: too many classification levels are involved; a classification hierarchy containing rare values; or contain a large proportion of the overall data. Therefore, the present embodiment solves the above-described problems by recognizing misclassifications. Specifically, as shown in fig. 5, the step of identifying the error classification includes the following steps:

As shown in fig. 6, the classificationUtil classification and identification method includes the following steps:

s501, acquiring all data information of a sample table;

As shown in fig. 7, the method for recognizing classification table or classification dictionary data includes the following steps:

The beneficial effect of this embodiment does: a small amount of classified samples can be removed or a small amount of classified sample data can be combined, and the overall trend of the data can be grasped from the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis.

Example 7

On the basis of embodiment 5 or embodiment 6, this embodiment provides specific steps for identifying outliers.

Outliers are extreme values that tend to deviate from other values. It is important to identify outliers so they may represent data entry errors. Furthermore, some statistical methods are sensitive to the presence of outliers, which may produce unreliable results even if the outliers are valid data points rather than errors. Therefore, in the method for identifying outliers in this embodiment, a statistical histogram is generated according to the existing data information, the statistical histogram is compared and analyzed to obtain outlier data information, and the data with small data volume and large data value difference is compared and analyzed to obtain the outlier data information. For example, in the acquisition of the archive data of village cloud residents, for the relevant data information of the data with the age of more than 120 years, the number of data samples is less according to an outlier algorithm, and the data can be subjected to identity card number auxiliary calculation to obtain whether the data belongs to the outlier, so that a data judgment basis is provided for the subsequent data restoration.

The beneficial effect of this embodiment does: and by identifying outliers, providing a data judgment basis for subsequent data restoration.

Example 8

The present embodiment provides specific steps for the center and spread metrics based on any of embodiments 5 through 7.

As shown in fig. 8, the center and spread metrics include the following steps:

The center metric is a special case of the position metric, which is a numerical summary that indicates the position of certain specific variables on the numerical axis. Examples of location metrics are percentile and quantile. The mean of the variables is the average of the significant values taken by the variables. A simple way to find the mean is to add all field values and divide by the sample size.

For variables where extreme tilt does not occur, the mean is generally not too far from the center of the variable. For extremely skewed data sets, the mean cannot represent the center of the variable. In addition, the mean value is also extremely important for the presence of outliers. For this reason, other central measures are used in the data analysis, such as median, which is defined as the median field value of the ascending variable set. The median is resistant to the presence of outliers. Another analysis method is to use a mode, which represents the field value with the highest frequency of occurrence. The mode may be used for numeric data or categorical data, but is not always associated with a variable center.

The beneficial effect of this embodiment does: when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can predict or calculate future data information according to the algorithm.

Example 9

On the basis of any of embodiments 5 to 7, the present embodiment provides specific contents of data transformation. In this embodiment, the data transformation includes min-max normalization and Z-score normalization.

The beneficial effect of this embodiment does: and (3) standardizing the range between different variables through min-max standardization and Z-score standardization, and reducing the adverse effect of the difference on the range on the mining result.

Example 10

On the basis of embodiment 9, this embodiment provides a data mining method, and in the data mining process, a description algorithm, an estimation and prediction statistical algorithm, a classification algorithm, a clustering algorithm, and an association rule are used.

In a descriptive task, some method of analyzing data is needed to describe the underlying patterns and trends of the data. The description of patterns and trends generally presents possible explanations for these patterns and trends, as well as suggestions of policy changes that may occur. In this embodiment, a description algorithm is used to explore the potential patterns and trends of the data. Specifically, the description algorithm analyzes potential patterns and trends in data through an exploratory data analysis method, a sample proportion or a regression equation;

the classification algorithm finds out common characteristics of a group of data objects in the database and divides the data objects into different classes according to a classification mode; the goal is to map data items in the database into a given category by a classification model.

Clustering is similar to classification, but unlike the purpose of classification, a clustering algorithm is to classify a set of data into several categories for the similarity and difference of the data. The similarity between data belonging to the same class is large, but the similarity between data of different classes is small, and the cross-class data relevance is low.

Association rules are associations or interrelationships that are hidden between data items, i.e. the occurrence of one data item can be deduced from the occurrence of other data items. The mining process of the association rules mainly comprises two stages: the first stage is to find out all high-frequency project groups from massive original data; the second phase is to generate association rules from the high frequency item group.

The understanding and preparing work of the data is completed through a description algorithm, a classification algorithm, a clustering algorithm and an association rule, and the description information of the data is collected by using a heuristic data analysis method. The next step requires the execution of statistical algorithms for estimation and prediction. Statistical algorithms for estimation and prediction analyze the same variable using single variable methods, statistical estimation and prediction methods, each including point estimation and confidence interval estimation for population means and proportions.

For basic data of a large data platform, main data sources are data information and Excel document data information of different basic level platform systems, and the data sources are relatively single. The present embodiment therefore uses statistical reasoning methods to estimate and predict the overall data situation. Statistical reasoning methods estimate and hypothesis test the overall characteristics based on the information contained in the sample. Where population refers to the set of all elements of interest in a particular study, this set contains people, things, and data. The sample is only a subset of the total data and is a representative subset of the total. If the sample is not representative in the population, i.e., the sample characteristics systematically deviate from the population characteristics, then statistical reasoning should not be used.

The main contents of statistical reasoning fall into two broad categories: firstly, a point estimation problem is solved; and II, hypothesis testing. The following description mainly explains point estimation, interval estimation, and hypothesis testing of data population parameters.

The point estimation method is the most direct and simple non-parameter estimation method in statistical inference, and is a calculation method directly replacing general corresponding indexes according to the law of large numbers and the information of sample statistics. Due to the diversity of the statistics and the necessity of the difference between the sample statistics and the overall corresponding index, we must evaluate and analyze the good properties of the sample statistics to select good statistics for statistical reasoning.

Interval estimation, a form of parameter estimation. By sampling from the population, based on certain accuracy and precision requirements, an appropriate interval is constructed as an estimate of the range within which the true value of the distribution parameter or function of the parameter of the population lies. The possible range of the overall parameter is represented by a distance on the numerical axis or by a data interval, which is referred to as the confidence interval of the interval estimation.

Hypothesis testing is a procedure that proves or overrides the statistically interconnected hypothesis regarding the studied characteristics of certain objects, phenomena and processes. Statistical assumptions are assumptions about the overall properties, which can be examined based on sampling observations. This hypothesis under test is the one relating to the statistically interrelated and characteristic value distributions. For example, the hypothesis that the set of characteristic values under study is normally distributed is a statistical hypothesis, and in social studies, the hypothesis of identity between two characteristic distributions, the hypothesis that the average value and the variance are equal, the hypothesis that a certain object belongs to a certain population, and the like are often tested. The statistical hypothesis testing process is to statistically prove the authenticity of the proposed hypothesis.

The beneficial effect of this embodiment does: some descriptive information is gathered by having completed data understanding and data preparation, and exploratory data analysis.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A big data platform BI analysis method is characterized in that: the method comprises the following steps:

2. The big data platform BI analysis method of claim 1, wherein: the data extraction component in the S1 comprises a general relational database extraction component and a general text extraction component;

s101, constructing a database connection configuration object;

s102, constructing a database connection operation interface;

s103, constructing a database basic information query interface;

s104, querying data information by using a JPA technology;

s201, constructing a text connection configuration object;

s202, constructing a text connection operation interface;

s203, constructing a text information reading interface.

3. The big data platform BI analysis method of claim 1, wherein: the missing data processing in S3 specifically includes the following steps:

4. The big data platform BI analysis method of claim 3, wherein: the identifying the misclassification comprises the steps of:

5. The big data platform BI analysis method of claim 4, wherein: the classificationUtil classification and identification method comprises the following steps:

s501, acquiring all data information of a sample table;

6. The big data platform BI analysis method of claim 5, wherein: the classification table or classification dictionary data identification method comprises the following steps:

7. The big data platform BI analysis method of claim 4, wherein: the center and spread metrics include the steps of:

8. The big data platform BI analysis method of claim 7, wherein: the data transformation in the S3 comprises min-max normalization and Z-score normalization;

the working mode of the Z-score standardization is as follows: the difference between the field value and the field mean is captured and scaled by the standard deviation SD of the field value.

9. The big data platform BI analysis method of claim 1, wherein: the statistical algorithms for estimation and prediction in S4 include a point estimation method, an interval estimation method, and a hypothesis verification method.