CN110990384A - Big data platform BI analysis method - Google Patents

Big data platform BI analysis method Download PDF

Info

Publication number
CN110990384A
CN110990384A CN201911066534.3A CN201911066534A CN110990384A CN 110990384 A CN110990384 A CN 110990384A CN 201911066534 A CN201911066534 A CN 201911066534A CN 110990384 A CN110990384 A CN 110990384A
Authority
CN
China
Prior art keywords
data
classification
information
value
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911066534.3A
Other languages
Chinese (zh)
Other versions
CN110990384B (en
Inventor
闻小明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Sinocare Technology Co ltd
Original Assignee
Wuhan Sinocare Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Sinocare Technology Co ltd filed Critical Wuhan Sinocare Technology Co ltd
Priority to CN201911066534.3A priority Critical patent/CN110990384B/en
Publication of CN110990384A publication Critical patent/CN110990384A/en
Application granted granted Critical
Publication of CN110990384B publication Critical patent/CN110990384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a BI analysis method of a big data platform, which can remove a small amount of classified samples or combine a small amount of classified sample data by adding recognition error classification in data cleaning, and grasp the overall trend of data in the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis; by adding the center and the dispersion measurement in data cleaning, when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can estimate or calculate future data information according to the algorithm.

Description

Big data platform BI analysis method
Technical Field
The invention relates to the field of big data, in particular to a BI analysis method for a big data platform.
Background
BI analysis is a process of loading data of a business system to a data warehouse after extraction, cleaning and conversion, and with the development of scientific and technical information, the problem of data quality is always a problem of close attention in the data mining process. Data cleaning is the most resource-consuming step in the data mining process, and how to effectively clean and convert data into a data source meeting the data mining requirements is a key factor influencing the data mining accuracy. The existing data cleaning method comprises missing value processing, abnormal value processing, duplication removing processing and noise data processing, and the methods can screen and remove duplicated and redundant data, complement missing data completely, correct or delete wrong data, and finally arrange the missing data into data which can be further processed and used. However, for the situation that the data quality requirement is high, the traditional data cleaning method cannot meet the data quality requirement, so that the assistant decision given by the BI analysis system is not accurate. Therefore, the invention provides a BI analysis method for a big data platform, which can improve the data quality of data cleaning and further provide more accurate analysis results.
Disclosure of Invention
In view of this, the invention provides a big data platform BI analysis method, which can improve the data quality of data cleaning and further provide a more accurate analysis result.
The technical scheme of the invention is realized as follows: the invention provides a BI analysis method for a big data platform, which comprises the following steps:
s1, constructing a data extraction component, and extracting data in the database or the text document by using the data extraction component;
s2, analyzing the target and the requirement according to the collected data, and creating a data model table structure;
s3, carrying out data cleaning and data transformation processing on the collected data, wherein the data cleaning comprises missing data processing, error classification identification, center and scatter measurement and outlier identification;
s4, describing potential patterns and trends of data through description algorithms, classifying the data by using classification algorithms, classifying a group of data into a plurality of categories according to the similarity and difference of the data by using clustering algorithms, generating association or correlation among data items by using association rules, and estimating and predicting by using statistical algorithms for estimation and prediction.
On the basis of the above technical solution, preferably, the data extraction component in S1 includes a general relational database extraction component and a general text extraction component;
the construction method of the general relational database extraction component comprises the following steps:
s101, constructing a database connection configuration object;
s102, constructing a database connection operation interface;
s103, constructing a database basic information query interface;
s104, querying data information by using a JPA technology;
the construction method of the universal text extraction component comprises the following steps:
s201, constructing a text connection configuration object;
s202, constructing a text connection operation interface;
s203, constructing a text information reading interface.
On the basis of the above technical solution, preferably, the missing data processing in S3 specifically includes the following steps:
s301, replacing a missing value by inserting a mean value or inserting classification information according to proportion under the condition of missing recorded data;
s302, when the current record is associated with a plurality of table data, the associated table data information is counted, the associated data type, the data amount and the percentage of each classification in the table data information are obtained, the mean range of the missing data is calculated, and the missing value is inserted according to the mean range, the percentage and the time variable of the missing data.
Further preferably, identifying the misclassification comprises the following steps:
s401, identifying classification information through a classification Util classification identification method, a classification table or a classification dictionary data identification method;
s402, respectively calculating the quoting condition of each category of information in the data according to the quoting classification information of the data samples in each data table, namely obtaining the quoting information of each classification in the whole data;
and S403, classifying the total data of all the data samples into a small number of classification information under the condition that the total data of all the classification reference data is lower than 1%.
Further preferably, the classificationUtil classification and identification method includes the following steps:
s501, acquiring all data information of a sample table;
s502, calculating data in each line of data in a hashMap key value pair mode, and automatically counting and adding one in a hashMap set, so that a repeated value of each line of data is obtained after all data are calculated;
and S503, when the repetition value is lower than 30, acquiring the classification data information in the high-frequency short data sample and normalizing the classification data information into classification information.
Further preferably, the classification table or classification dictionary data identification method includes the steps of:
s601, acquiring all data of a classification table or a dictionary, and identifying classification names, classification numbers, classification codes and paths in the data;
s602, obtaining repeated data citation according to a classificationUtil classification identification method, and analyzing a classification number or a classification code and a higher classification number or a classification code according to a data counting value;
s603, for a single data dictionary or classification table, a single upper-level code corresponds to a plurality of lower-level codes, and the coded data is frequently referred in other data tables, namely, the coded data can be analyzed into single-type classification information or classification tables.
Further preferably, the center and spread metrics comprise the steps of:
s701, acquiring all data information of the form, and calculating a data value range of the field type according to digital type data in the data information;
s702, acquiring the repeated occurrence times of different data by using a hashMap object according to the data value range and the repeated occurrence times of different data, and calculating the repeated times of each segment of data in a segmented manner according to the total number of the acquired values;
s703, acquiring a main central point and measurement information, namely an average value number and a median, according to the repetition times of the segmented data;
and S704, deducing the user interest points according to the conditions of time, mean value and data range when the mean value number and the median are taken as standards and data sample data of the same type are counted.
Further preferably, the data transformation in S3 includes min-max normalization and Z-score normalization;
the min-max normalized working mode is as follows: observing a difference between the field value and the minimum value and scaling the difference by a range;
the working mode of Z-score normalization is: the difference between the field value and the field mean is captured and scaled by the standard deviation SD of the field value.
On the basis of the above technical solution, preferably, the statistical algorithm for estimation and prediction in S4 includes a point estimation method, an interval estimation method, and an assumption verification method.
Compared with the prior art, the big data platform BI analysis method has the following beneficial effects:
(1) by adding recognition error classification in data cleaning, a small amount of classified samples can be removed or combined, and the overall trend of the data is grasped from the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis;
(2) by adding the center and the dispersion measurement in data cleaning, when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can estimate or calculate future data information according to the algorithm;
(3) because different government affair platforms only process related main business of the platform, different business data are accumulated by each platform according to business characteristics of the platform, and therefore a large amount of Excel form documents and business data are accumulated. The data configuration information can be constructed through the data extraction component, and the data configuration information is connected with the database or the text file, so that the data in the database and the data in the Excel form document can be extracted, the service data of each platform is integrated, and the data barrier is broken through.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a big data platform BI analysis method of the present invention;
FIG. 2 is a flow chart of a method for constructing a generic relational database extraction component in a big data platform BI analysis method of the present invention;
FIG. 3 is a flow chart of a method for constructing a universal text extraction component in a big data platform BI analysis method according to the present invention;
FIG. 4 is a flow chart of missing data processing in a big data platform BI analysis method of the present invention;
FIG. 5 is a flow chart of the identification of error classifications in a big data platform BI analysis method of the present invention;
FIG. 6 is a flow chart of the classificationUtil classification identification method of FIG. 5;
FIG. 7 is a flow chart of a method of identifying classification table or classification dictionary data in FIG. 5;
FIG. 8 is a flow chart of the hub and scatter metrics in a big data platform BI analysis method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in FIG. 1, the big data platform BI analysis method of the present invention comprises the following steps:
s1, constructing a data extraction component, and extracting data in the database or the text document by using the data extraction component;
s2, analyzing the target and the requirement according to the collected data, and creating a data model table structure;
s3, carrying out data cleaning and data transformation processing on the collected data, wherein the data cleaning comprises missing data processing, error classification identification, center and scatter measurement and outlier identification;
s4, describing potential patterns and trends of data through description algorithms, classifying the data by using classification algorithms, classifying a group of data into a plurality of categories according to the similarity and difference of the data by using clustering algorithms, generating association or correlation among data items by using association rules, and estimating and predicting by using statistical algorithms for estimation and prediction.
The beneficial effect of this embodiment does: the accuracy of data mining can be effectively improved by identifying the wrong classification, a small amount of classified samples are removed or combined, and the overall trend of the data is grasped from the data integrity; in addition, the method can help a user to eliminate the influence of a small number of data samples on the trend of the whole data when the data model is created;
mean, median and mode can be obtained by the center and spread metrics; the mean value is an important part in data mining, corresponding data mining algorithms can be drawn up according to user interest points and data mining directions, different data algorithms need to be calculated by referring to different mean values in a data set, and therefore the estimated data which we want is estimated or analyzed, and the median and the mode are the same;
the potential patterns and trends of data can be described through a description algorithm, the classification algorithm divides the data in the database into different classes according to common characteristics, a group of data is divided into a plurality of different classes according to the similarity and difference of the data through a clustering algorithm, the similarity between the data belonging to the same class is large, the similarity between the data belonging to the same class is small, the cross-class data association is low, the occurrence of other data items can be deduced according to the occurrence of one data item through an association rule, the relation and the rule existing in the data are discovered through an estimated and predicted statistical algorithm, and the future development trend is predicted according to the existing data.
Example 2
The embodiment provides a data extraction mode on the basis of the embodiment 1. The data extraction component is used for extracting data and realizing the butt joint of the data and the system. The data docking needs to include Excel table data reading capability and data extraction capability of traditional relational databases such as MySQL, Oracle and SQLServer. By the capability, data information based on the analysis method can be extracted, and a data base can be provided for the next step.
The data extraction component comprises a general relational database extraction component and a general text extraction component.
As shown in fig. 2, the method for constructing the general relational database extraction component includes the following steps:
s101, constructing a database connection configuration object;
s102, constructing a database connection operation interface;
s103, constructing a database basic information query interface;
s104, querying data information by using a JPA technology;
as shown in fig. 3, the method for constructing the universal text extraction component includes the following steps:
s201, constructing a text connection configuration object;
s202, constructing a text connection operation interface;
s203, constructing a text information reading interface.
Taking a village service cloud platform as an example, reading data information of the platform, constructing a relational database docking assembly by adopting a JPA (Java Persistence API) technology, and directly reading database service data information after configuring different database connection information. And the Excel file reading mode adopts a POI component of Apache and expands the file reading mode.
The beneficial effect of this embodiment does: because different government affair platforms only process related main business of the platform, different business data are accumulated by each platform according to business characteristics of the platform, and therefore a large amount of Excel form documents and business data are accumulated. In the embodiment, data configuration information can be constructed through the data extraction component, and the database or the text file is connected, so that data in the database and data in the Excel form document can be extracted, service data of each platform can be integrated, and data barriers can be opened.
Example 3
On the basis of embodiment 1 or embodiment 2, the present embodiment provides a data modeling method. And data are acquired according to the general data extraction component, and a general storage scheme is provided for the acquired data information, namely, the data basis modeling is carried out. The collected data information is stored by adopting a standard data model.
Data modeling is a process of an information system for defining and analyzing the requirements of data and the corresponding support it needs. Data modeling defines not only data elements, but also their structure and relationships between them.
The data modeling is realized by the following steps:
s801, analyzing a target according to user data, and combing analysis requirements;
s802, establishing a data model table structure by using a data modeling component;
s803, collecting data information by using a data extraction component;
s804, converting the acquired data information by using a data processing assembly;
and S805, saving the converted data information by using the data storage component.
The beneficial effects of this embodiment do: the method can acquire original data source table structure information according to different data sources, further construct a data model table structure, and after construction is completed, system developers can fill data in the data model table structure, so that data integration is further completed, and a data basis is provided for data processing and mining in the next step.
Example 4
On the basis of embodiment 3, the present embodiment provides an embodiment of data modeling, and the present embodiment takes a rural data platform as an example. According to the modeling method of embodiment 3, the modeling method is described in detail in this embodiment.
S801, analyzing a target according to user data, and combing analysis requirements;
specifically, data are integrated and stored, and a special subject database is built according to requirements.
S802, establishing a data model table structure by using a data modeling component; the method specifically comprises the following four steps:
the first step is to fulfill the basic data integration requirement. The requirements need to read data source original data information through a data acquisition assembly to acquire original data source table structure information.
And secondly, completing population data integration. The demand mainly integrates population data and associated data information, basic data is imported in a similar basic data integration mode, and an imported data copy model is created. The important point of attention here is that the association set model is constructed from the key information of the population data, such as the key information of name, identification number, main key of population information, etc. The association set model mainly comprises: the method comprises the steps of a main key, creation time, updating time, name, identification card number, data source main key and association table set; according to the model, the population associated data information can be effectively integrated, and support is provided for subsequent data analysis.
And thirdly, completing thematic data integration. The method mainly integrates thematic data and associated data information, basic data are imported in a similar basic data integration mode, and an imported data copy model is created. The key point of attention here is the key word information of the thematic data, such as key information of thematic names, thematic numbers, thematic data main keys, thematic time and the like, and the communication association set model.
The model mainly comprises: key, creation time, update time, topic name, topic number, topic key, topic time, and association table set. And providing associated data information for the event topic information according to the model, and simultaneously supporting subsequent data analysis services.
S803, collecting data information by using a data extraction component;
s804, converting the acquired data information by using a data processing assembly;
and S805, saving the converted data information by using the data storage component.
Example 5
On the basis of embodiment 3 or 4, the present embodiment provides a specific method of data cleansing and data transformation; the data cleaning process "cleans up" data by filling in missing values, smoothing noisy data, identifying or deleting outliers and resolving inconsistencies, mainly to achieve the goals of format standardization, abnormal data clean-up, error correction, and duplicate data clean-up. Data transformation uses normalization techniques to normalize numerical variables of data in order to normalize the degree of influence of each variable on the result.
In this embodiment, data cleansing includes missing data processing, identifying error classifications, center and spread metrics, identifying outliers. The present embodiment specifically describes the working principle of missing data processing.
Missing data is a problem that continues to plague data analysis methods, and even as analysis methods become more sophisticated, we can nevertheless encounter the problem of missing field values, particularly in databases that have large numbers of fields. The absence of information is extremely detrimental to data analysis, and under equivalent conditions, the more information the better in general. Therefore, the method of selecting the replacement value is selected to process the missing value in the present embodiment. As shown in fig. 4, the method specifically includes the following steps:
s301, replacing a missing value by inserting a mean value or inserting classification information according to proportion under the condition of missing recorded data;
s302, when the current record is associated with a plurality of table data, the associated table data information is counted, the associated data type, the data amount and the percentage of each classification in the table data information are obtained, the mean range of the missing data is calculated, and the missing value is inserted according to the mean range, the percentage and the time variable of the missing data.
The beneficial effect of this embodiment does: the problem of relevant missing values in common basic data can be processed through missing data, and the overall regularity of the collected data is improved.
Example 6
On the basis of embodiment 5, the present embodiment provides a specific step of identifying the error classification.
Most of the problems with bad models stem from not well handling typed variables. There are three types of processing methods for categorical variables: too many classification levels are involved; a classification hierarchy containing rare values; or contain a large proportion of the overall data. Therefore, the present embodiment solves the above-described problems by recognizing misclassifications. Specifically, as shown in fig. 5, the step of identifying the error classification includes the following steps:
s401, identifying classification information through a classification Util classification identification method, a classification table or a classification dictionary data identification method;
s402, respectively calculating the quoting condition of each category of information in the data according to the quoting classification information of the data samples in each data table, namely obtaining the quoting information of each classification in the whole data;
and S403, classifying the total data of all the data samples into a small number of classification information under the condition that the total data of all the classification reference data is lower than 1%.
As shown in fig. 6, the classificationUtil classification and identification method includes the following steps:
s501, acquiring all data information of a sample table;
s502, calculating data in each line of data in a hashMap key value pair mode, and automatically counting and adding one in a hashMap set, so that a repeated value of each line of data is obtained after all data are calculated;
and S503, when the repetition value is lower than 30, acquiring the classification data information in the high-frequency short data sample and normalizing the classification data information into classification information.
As shown in fig. 7, the method for recognizing classification table or classification dictionary data includes the following steps:
s601, acquiring all data of a classification table or a dictionary, and identifying classification names, classification numbers, classification codes and paths in the data;
s602, obtaining repeated data citation according to a classificationUtil classification identification method, and analyzing a classification number or a classification code and a higher classification number or a classification code according to a data counting value;
s603, for a single data dictionary or classification table, a single upper-level code corresponds to a plurality of lower-level codes, and the coded data is frequently referred in other data tables, namely, the coded data can be analyzed into single-type classification information or classification tables.
The beneficial effect of this embodiment does: a small amount of classified samples can be removed or a small amount of classified sample data can be combined, and the overall trend of the data can be grasped from the data integrity; the minority classification information is used as a special classification or a small amount of data types, a plurality of minority classification information can be classified into other types as a whole, and other types of each data set are summarized on the constructed data model, so that a more accurate analysis result can be provided for macroscopic data analysis or overall data analysis.
Example 7
On the basis of embodiment 5 or embodiment 6, this embodiment provides specific steps for identifying outliers.
Outliers are extreme values that tend to deviate from other values. It is important to identify outliers so they may represent data entry errors. Furthermore, some statistical methods are sensitive to the presence of outliers, which may produce unreliable results even if the outliers are valid data points rather than errors. Therefore, in the method for identifying outliers in this embodiment, a statistical histogram is generated according to the existing data information, the statistical histogram is compared and analyzed to obtain outlier data information, and the data with small data volume and large data value difference is compared and analyzed to obtain the outlier data information. For example, in the acquisition of the archive data of village cloud residents, for the relevant data information of the data with the age of more than 120 years, the number of data samples is less according to an outlier algorithm, and the data can be subjected to identity card number auxiliary calculation to obtain whether the data belongs to the outlier, so that a data judgment basis is provided for the subsequent data restoration.
The beneficial effect of this embodiment does: and by identifying outliers, providing a data judgment basis for subsequent data restoration.
Example 8
The present embodiment provides specific steps for the center and spread metrics based on any of embodiments 5 through 7.
As shown in fig. 8, the center and spread metrics include the following steps:
s701, acquiring all data information of the form, and calculating a data value range of the field type according to digital type data in the data information;
s702, acquiring the repeated occurrence times of different data by using a hashMap object according to the data value range and the repeated occurrence times of different data, and calculating the repeated times of each segment of data in a segmented manner according to the total number of the acquired values;
s703, acquiring a main central point and measurement information, namely an average value number and a median, according to the repetition times of the segmented data;
and S704, deducing the user interest points according to the conditions of time, mean value and data range when the mean value number and the median are taken as standards and data sample data of the same type are counted.
The center metric is a special case of the position metric, which is a numerical summary that indicates the position of certain specific variables on the numerical axis. Examples of location metrics are percentile and quantile. The mean of the variables is the average of the significant values taken by the variables. A simple way to find the mean is to add all field values and divide by the sample size.
For variables where extreme tilt does not occur, the mean is generally not too far from the center of the variable. For extremely skewed data sets, the mean cannot represent the center of the variable. In addition, the mean value is also extremely important for the presence of outliers. For this reason, other central measures are used in the data analysis, such as median, which is defined as the median field value of the ascending variable set. The median is resistant to the presence of outliers. Another analysis method is to use a mode, which represents the field value with the highest frequency of occurrence. The mode may be used for numeric data or categorical data, but is not always associated with a variable center.
The beneficial effect of this embodiment does: when a user selects a certain type of data as reference core sample data, a certain data range can be set according to the mean value or the median, the data range is integrated by combining time and other associated data information, a preliminary data mining algorithm can be produced, and the user can predict or calculate future data information according to the algorithm.
Example 9
On the basis of any of embodiments 5 to 7, the present embodiment provides specific contents of data transformation. In this embodiment, the data transformation includes min-max normalization and Z-score normalization.
The min-max normalized working mode is as follows: observing a difference between the field value and the minimum value and scaling the difference by a range;
the working mode of Z-score normalization is: the difference between the field value and the field mean is captured and scaled by the standard deviation SD of the field value.
The beneficial effect of this embodiment does: and (3) standardizing the range between different variables through min-max standardization and Z-score standardization, and reducing the adverse effect of the difference on the range on the mining result.
Example 10
On the basis of embodiment 9, this embodiment provides a data mining method, and in the data mining process, a description algorithm, an estimation and prediction statistical algorithm, a classification algorithm, a clustering algorithm, and an association rule are used.
In a descriptive task, some method of analyzing data is needed to describe the underlying patterns and trends of the data. The description of patterns and trends generally presents possible explanations for these patterns and trends, as well as suggestions of policy changes that may occur. In this embodiment, a description algorithm is used to explore the potential patterns and trends of the data. Specifically, the description algorithm analyzes potential patterns and trends in data through an exploratory data analysis method, a sample proportion or a regression equation;
the classification algorithm finds out common characteristics of a group of data objects in the database and divides the data objects into different classes according to a classification mode; the goal is to map data items in the database into a given category by a classification model.
Clustering is similar to classification, but unlike the purpose of classification, a clustering algorithm is to classify a set of data into several categories for the similarity and difference of the data. The similarity between data belonging to the same class is large, but the similarity between data of different classes is small, and the cross-class data relevance is low.
Association rules are associations or interrelationships that are hidden between data items, i.e. the occurrence of one data item can be deduced from the occurrence of other data items. The mining process of the association rules mainly comprises two stages: the first stage is to find out all high-frequency project groups from massive original data; the second phase is to generate association rules from the high frequency item group.
The understanding and preparing work of the data is completed through a description algorithm, a classification algorithm, a clustering algorithm and an association rule, and the description information of the data is collected by using a heuristic data analysis method. The next step requires the execution of statistical algorithms for estimation and prediction. Statistical algorithms for estimation and prediction analyze the same variable using single variable methods, statistical estimation and prediction methods, each including point estimation and confidence interval estimation for population means and proportions.
For basic data of a large data platform, main data sources are data information and Excel document data information of different basic level platform systems, and the data sources are relatively single. The present embodiment therefore uses statistical reasoning methods to estimate and predict the overall data situation. Statistical reasoning methods estimate and hypothesis test the overall characteristics based on the information contained in the sample. Where population refers to the set of all elements of interest in a particular study, this set contains people, things, and data. The sample is only a subset of the total data and is a representative subset of the total. If the sample is not representative in the population, i.e., the sample characteristics systematically deviate from the population characteristics, then statistical reasoning should not be used.
The main contents of statistical reasoning fall into two broad categories: firstly, a point estimation problem is solved; and II, hypothesis testing. The following description mainly explains point estimation, interval estimation, and hypothesis testing of data population parameters.
The point estimation method is the most direct and simple non-parameter estimation method in statistical inference, and is a calculation method directly replacing general corresponding indexes according to the law of large numbers and the information of sample statistics. Due to the diversity of the statistics and the necessity of the difference between the sample statistics and the overall corresponding index, we must evaluate and analyze the good properties of the sample statistics to select good statistics for statistical reasoning.
Interval estimation, a form of parameter estimation. By sampling from the population, based on certain accuracy and precision requirements, an appropriate interval is constructed as an estimate of the range within which the true value of the distribution parameter or function of the parameter of the population lies. The possible range of the overall parameter is represented by a distance on the numerical axis or by a data interval, which is referred to as the confidence interval of the interval estimation.
Hypothesis testing is a procedure that proves or overrides the statistically interconnected hypothesis regarding the studied characteristics of certain objects, phenomena and processes. Statistical assumptions are assumptions about the overall properties, which can be examined based on sampling observations. This hypothesis under test is the one relating to the statistically interrelated and characteristic value distributions. For example, the hypothesis that the set of characteristic values under study is normally distributed is a statistical hypothesis, and in social studies, the hypothesis of identity between two characteristic distributions, the hypothesis that the average value and the variance are equal, the hypothesis that a certain object belongs to a certain population, and the like are often tested. The statistical hypothesis testing process is to statistically prove the authenticity of the proposed hypothesis.
The beneficial effect of this embodiment does: some descriptive information is gathered by having completed data understanding and data preparation, and exploratory data analysis.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A big data platform BI analysis method is characterized in that: the method comprises the following steps:
s1, constructing a data extraction component, and extracting data in the database or the text document by using the data extraction component;
s2, analyzing the target and the requirement according to the collected data, and creating a data model table structure;
s3, carrying out data cleaning and data transformation processing on the collected data, wherein the data cleaning comprises missing data processing, error classification identification, center and scatter measurement and outlier identification;
s4, describing potential patterns and trends of data through description algorithms, classifying the data by using classification algorithms, classifying a group of data into a plurality of categories according to the similarity and difference of the data by using clustering algorithms, generating association or correlation among data items by using association rules, and estimating and predicting by using statistical algorithms for estimation and prediction.
2. The big data platform BI analysis method of claim 1, wherein: the data extraction component in the S1 comprises a general relational database extraction component and a general text extraction component;
the construction method of the general relational database extraction component comprises the following steps:
s101, constructing a database connection configuration object;
s102, constructing a database connection operation interface;
s103, constructing a database basic information query interface;
s104, querying data information by using a JPA technology;
the construction method of the universal text extraction component comprises the following steps:
s201, constructing a text connection configuration object;
s202, constructing a text connection operation interface;
s203, constructing a text information reading interface.
3. The big data platform BI analysis method of claim 1, wherein: the missing data processing in S3 specifically includes the following steps:
s301, replacing a missing value by inserting a mean value or inserting classification information according to proportion under the condition of missing recorded data;
s302, when the current record is associated with a plurality of table data, the associated table data information is counted, the associated data type, the data amount and the percentage of each classification in the table data information are obtained, the mean range of the missing data is calculated, and the missing value is inserted according to the mean range, the percentage and the time variable of the missing data.
4. The big data platform BI analysis method of claim 3, wherein: the identifying the misclassification comprises the steps of:
s401, identifying classification information through a classification Util classification identification method, a classification table or a classification dictionary data identification method;
s402, respectively calculating the quoting condition of each category of information in the data according to the quoting classification information of the data samples in each data table, namely obtaining the quoting information of each classification in the whole data;
and S403, classifying the total data of all the data samples into a small number of classification information under the condition that the total data of all the classification reference data is lower than 1%.
5. The big data platform BI analysis method of claim 4, wherein: the classificationUtil classification and identification method comprises the following steps:
s501, acquiring all data information of a sample table;
s502, calculating data in each line of data in a hashMap key value pair mode, and automatically counting and adding one in a hashMap set, so that a repeated value of each line of data is obtained after all data are calculated;
and S503, when the repetition value is lower than 30, acquiring the classification data information in the high-frequency short data sample and normalizing the classification data information into classification information.
6. The big data platform BI analysis method of claim 5, wherein: the classification table or classification dictionary data identification method comprises the following steps:
s601, acquiring all data of a classification table or a dictionary, and identifying classification names, classification numbers, classification codes and paths in the data;
s602, obtaining repeated data citation according to a classificationUtil classification identification method, and analyzing a classification number or a classification code and a higher classification number or a classification code according to a data counting value;
s603, for a single data dictionary or classification table, a single upper-level code corresponds to a plurality of lower-level codes, and the coded data is frequently referred in other data tables, namely, the coded data can be analyzed into single-type classification information or classification tables.
7. The big data platform BI analysis method of claim 4, wherein: the center and spread metrics include the steps of:
s701, acquiring all data information of the form, and calculating a data value range of the field type according to digital type data in the data information;
s702, acquiring the repeated occurrence times of different data by using a hashMap object according to the data value range and the repeated occurrence times of different data, and calculating the repeated times of each segment of data in a segmented manner according to the total number of the acquired values;
s703, acquiring a main central point and measurement information, namely an average value number and a median, according to the repetition times of the segmented data;
and S704, deducing the user interest points according to the conditions of time, mean value and data range when the mean value number and the median are taken as standards and data sample data of the same type are counted.
8. The big data platform BI analysis method of claim 7, wherein: the data transformation in the S3 comprises min-max normalization and Z-score normalization;
the min-max normalized working mode is as follows: observing a difference between the field value and the minimum value and scaling the difference by a range;
the working mode of the Z-score standardization is as follows: the difference between the field value and the field mean is captured and scaled by the standard deviation SD of the field value.
9. The big data platform BI analysis method of claim 1, wherein: the statistical algorithms for estimation and prediction in S4 include a point estimation method, an interval estimation method, and a hypothesis verification method.
CN201911066534.3A 2019-11-04 2019-11-04 Big data platform BI analysis method Active CN110990384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911066534.3A CN110990384B (en) 2019-11-04 2019-11-04 Big data platform BI analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911066534.3A CN110990384B (en) 2019-11-04 2019-11-04 Big data platform BI analysis method

Publications (2)

Publication Number Publication Date
CN110990384A true CN110990384A (en) 2020-04-10
CN110990384B CN110990384B (en) 2023-08-22

Family

ID=70082982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911066534.3A Active CN110990384B (en) 2019-11-04 2019-11-04 Big data platform BI analysis method

Country Status (1)

Country Link
CN (1) CN110990384B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579581A (en) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 Data access method and system of data analysis engine
CN114466393A (en) * 2022-04-13 2022-05-10 深圳市永达电子信息股份有限公司 Rail transit vehicle-ground communication potential risk monitoring method and system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136462A1 (en) * 2004-12-16 2006-06-22 Campos Marcos M Data-centric automatic data mining
CN102135979A (en) * 2010-12-08 2011-07-27 华为技术有限公司 Data cleaning method and device
CN103761311A (en) * 2014-01-23 2014-04-30 中国矿业大学 Sentiment classification method based on multi-source field instance migration
CN106022477A (en) * 2016-05-18 2016-10-12 国网信通亿力科技有限责任公司 Intelligent analysis decision system and method
CN106776703A (en) * 2016-11-15 2017-05-31 上海汉邦京泰数码技术有限公司 A kind of multivariate data cleaning technique under virtualized environment
US20170192975A1 (en) * 2015-12-31 2017-07-06 Ebay Inc. System and method for identifying miscategorization
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
US20180189664A1 (en) * 2015-06-26 2018-07-05 National University Of Ireland, Galway Data analysis and event detection method and system
US20180218070A1 (en) * 2017-02-01 2018-08-02 Wipro Limited System and method of data cleansing for improved data classification
US20180322096A1 (en) * 2017-05-05 2018-11-08 Han-Wei Zhang Data analysis system and analysis method therefor
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
CN110120040A (en) * 2019-05-13 2019-08-13 广州锟元方青医疗科技有限公司 Sectioning image processing method, device, computer equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136462A1 (en) * 2004-12-16 2006-06-22 Campos Marcos M Data-centric automatic data mining
CN102135979A (en) * 2010-12-08 2011-07-27 华为技术有限公司 Data cleaning method and device
CN103761311A (en) * 2014-01-23 2014-04-30 中国矿业大学 Sentiment classification method based on multi-source field instance migration
US20180189664A1 (en) * 2015-06-26 2018-07-05 National University Of Ireland, Galway Data analysis and event detection method and system
US20170192975A1 (en) * 2015-12-31 2017-07-06 Ebay Inc. System and method for identifying miscategorization
CN106022477A (en) * 2016-05-18 2016-10-12 国网信通亿力科技有限责任公司 Intelligent analysis decision system and method
CN106776703A (en) * 2016-11-15 2017-05-31 上海汉邦京泰数码技术有限公司 A kind of multivariate data cleaning technique under virtualized environment
US20180218070A1 (en) * 2017-02-01 2018-08-02 Wipro Limited System and method of data cleansing for improved data classification
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
US20180322096A1 (en) * 2017-05-05 2018-11-08 Han-Wei Zhang Data analysis system and analysis method therefor
CN109739850A (en) * 2019-01-11 2019-05-10 安徽爱吉泰克科技有限公司 A kind of archives big data intellectual analysis cleaning digging system
CN110120040A (en) * 2019-05-13 2019-08-13 广州锟元方青医疗科技有限公司 Sectioning image processing method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
殷复莲, pages: 32 - 34 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579581A (en) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 Data access method and system of data analysis engine
CN112579581B (en) * 2020-11-30 2023-04-14 贵州力创科技发展有限公司 Data access method and system of data analysis engine
CN114466393A (en) * 2022-04-13 2022-05-10 深圳市永达电子信息股份有限公司 Rail transit vehicle-ground communication potential risk monitoring method and system
CN114466393B (en) * 2022-04-13 2022-07-12 深圳市永达电子信息股份有限公司 Rail transit vehicle-ground communication potential risk monitoring method and system

Also Published As

Publication number Publication date
CN110990384B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
WO2021164382A1 (en) Method and apparatus for performing feature processing for user classification model
US7606784B2 (en) Uncertainty management in a decision-making system
Magidson et al. An extension of the CHAID tree-based segmentation algorithm to multiple dependent variables
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN112348519A (en) Method and device for identifying fraudulent user and electronic equipment
CN111612038A (en) Abnormal user detection method and device, storage medium and electronic equipment
CN112488716A (en) Abnormal event detection system
Alghobiri A comparative analysis of classification algorithms on diverse datasets
CN110990384A (en) Big data platform BI analysis method
CN116226103A (en) Method for detecting government data quality based on FPGrow algorithm
Becker et al. Rough set theory in the classification of loan applications
CN112131106B (en) Test data construction method and device based on small probability data
CN115034762A (en) Post recommendation method and device, storage medium, electronic equipment and product
Perner Concepts for novelty detection and handling based on a case-based reasoning process scheme
CN114626940A (en) Data analysis method and device and electronic equipment
Pirim Mathematical programming for social network analysis
D’Orazio Some Approaches to Outliers’ Detection in R
Aliyudin et al. APPLICATION OF THE C5. 0 ALGORITHM TO DETERMINE GOOD OR BAD ON 5S AUDIT RESULTS
CN117371861B (en) Digital-based household service quality intelligent analysis method and system
Feng et al. A new rough set based Bayesian classifier prior assumption
Hu Decision rule induction for service sector using data mining-A rough set theory approach
Alemi et al. Sri Surya Krishna Rama Taraka Naren Durbha, Manaf Zargoush. Two Nearest Means Method: Regression through Searching in the Data
Perea Predicting The Occupation Progress of A Person Using Decision Tree-Based Analysis
CN115237970A (en) Data prediction method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant